Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.51advanced4 min read

ClickHouse for Analytics-Scale Storage

When you're storing billions of scraped events and need second-level analytical queries, ClickHouse is the right tool. Schema design, ingestion patterns, and the pitfalls.

What you’ll learn

  • Recognise when scraped data has outgrown Postgres for analytics.
  • Design ClickHouse tables with the right ORDER BY and partition.
  • Ingest efficiently via batched INSERTs.

Once a scraping project lands in the hundreds-of-millions-of-rows territory, price observations across thousands of products and dozens of sources, every hour, Postgres analytical queries get slow and expensive. ClickHouse is built for that exact workload: columnar, compressed, and absurdly fast on aggregation over wide rows.

When ClickHouse earns its slot

Use case Postgres ClickHouse
OLTP (CRUD per row) Good Bad (small writes are expensive)
Reports over 100M+ rows Slow Sub-second
Storing every price observation forever Cheap-ish Very cheap (10x compression typical)
Joins across many tables Strong OK but not its strength
Updates and deletes First-class Async, eventual, painful

Heuristic: append-only, read-heavy analytical data over time = ClickHouse. State-of-the-world rows you update = Postgres or Mongo. Many production stacks use both: Postgres for current state, ClickHouse for history.

A canonical scraping schema

Price observations from many sources:

CREATE TABLE price_events (
  source  LowCardinality(String),
  source_id  String,
  observed_at  DateTime,
  price_cents  Int32,
  currency  LowCardinality(FixedString(3)),
  in_stock  UInt8,
  proxy_region LowCardinality(String),
  response_ms  UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(observed_at)
ORDER BY (source, source_id, observed_at);

Three things to notice:

  • LowCardinality for repeated-string columns (source names, region codes). Stored as dictionaries; massive compression.
  • PARTITION BY toYYYYMM, old months can be dropped or cold-storaged.
  • ORDER BY (source, source_id, observed_at), this is your primary access pattern. ClickHouse skips entire data parts whose (source, source_id) range doesn't overlap your query.

ORDER BY choice is the single most important schema decision. Get it right based on the queries you'll run.

Ingestion: batch, not row-by-row

ClickHouse is bad at single-row inserts. Aim for batches of 1k–100k rows per INSERT. Single inserts trigger background merges and amplify writes.

from clickhouse_connect import get_client
client = get_client(host="localhost")

rows = []
for item in batch:
  rows.append((
  item["source"], item["source_id"], item["observed_at"],
  item["price_cents"], item["currency"], int(item["in_stock"]),
  item["proxy_region"], item["response_ms"]
  ))
client.insert("price_events", rows, column_names=[
  "source", "source_id", "observed_at", "price_cents", "currency",
  "in_stock", "proxy_region", "response_ms"
])

In PHP via the HTTP interface:

$client = new ClickHouseDB\Client(['host' => 'localhost']);
$client->insert('price_events', $rows, [
  'source', 'source_id', 'observed_at', 'price_cents',
  'currency', 'in_stock', 'proxy_region', 'response_ms'
]);

For high-throughput pipelines, route inserts through Kafka with ClickHouse Kafka engine consuming. Direct inserts work fine up to a few thousand rows per second per node.

Queries that fly

-- Median price per source over the last 30 days
SELECT
  source,
  quantile(0.5)(price_cents) / 100 AS median_usd,
  count() AS observations
FROM price_events
WHERE observed_at > now() - INTERVAL 30 DAY
GROUP BY source
ORDER BY median_usd DESC
LIMIT 20;

Over 500M rows on a single node, that's sub-second. Postgres on the same hardware would take 30s+ even with good indexes.

-- Price change detection: when did price drop > 10%?
SELECT
  source, source_id,
  observed_at,
  price_cents,
  lagInFrame(price_cents) OVER (
  PARTITION BY source, source_id ORDER BY observed_at
  ) AS prev_price
FROM price_events
WHERE source = 'catalog108' AND price_cents < prev_price * 0.9
LIMIT 100;

Window functions are first-class.

What ClickHouse is bad at

  • Per-row updates. Possible via ALTER TABLE ... UPDATE, but they're asynchronous mutations and expensive. Don't model state you'd update, re-insert with a version column and use ReplacingMergeTree if you must.
  • Many small inserts. Buffer them. Use the Kafka or Buffer table engines.
  • Strong transactions. ClickHouse is eventually-consistent across replicas.
  • JOINs at scale. They work but plan around them, dictionary lookups and pre-joins are faster patterns.

Materialized views for rollups

Instead of querying raw events every time:

CREATE MATERIALIZED VIEW daily_price_summary
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (source, source_id, day)
AS SELECT
  source, source_id,
  toDate(observed_at) AS day,
  count() AS observations,
  min(price_cents) AS min_price,
  max(price_cents) AS max_price
FROM price_events
GROUP BY source, source_id, day;

Inserts to price_events automatically aggregate into the materialized view. Daily reports read the small rollup table, not the raw events.

Cost economics

ClickHouse on a 4-core / 16GB VM happily handles 100M+ events. Compared to a Postgres setup needing 64GB+ to do similar analytical queries, it's substantially cheaper. Disk is the main cost, even at 10x compression, billions of events add up. Old partitions can move to cold object storage (S3-backed tables).

What to try

Run ClickHouse locally (docker run -d -p 8123:8123 clickhouse/clickhouse-server), create the price_events schema, generate 10M fake observations from Catalog108 prices over a year, then write the median-per-source query. Watch it return in <500ms on a laptop.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

ClickHouse for Analytics-Scale Storage1 / 8

Which scraping workload is the strongest fit for ClickHouse?

Score so far: 0 / 0