ClickHouse for Analytics-Scale Storage
When you're storing billions of scraped events and need second-level analytical queries, ClickHouse is the right tool. Schema design, ingestion patterns, and the pitfalls.
What you’ll learn
- Recognise when scraped data has outgrown Postgres for analytics.
- Design ClickHouse tables with the right ORDER BY and partition.
- Ingest efficiently via batched INSERTs.
Once a scraping project lands in the hundreds-of-millions-of-rows territory, price observations across thousands of products and dozens of sources, every hour, Postgres analytical queries get slow and expensive. ClickHouse is built for that exact workload: columnar, compressed, and absurdly fast on aggregation over wide rows.
When ClickHouse earns its slot
| Use case | Postgres | ClickHouse |
|---|---|---|
| OLTP (CRUD per row) | Good | Bad (small writes are expensive) |
| Reports over 100M+ rows | Slow | Sub-second |
| Storing every price observation forever | Cheap-ish | Very cheap (10x compression typical) |
| Joins across many tables | Strong | OK but not its strength |
| Updates and deletes | First-class | Async, eventual, painful |
Heuristic: append-only, read-heavy analytical data over time = ClickHouse. State-of-the-world rows you update = Postgres or Mongo. Many production stacks use both: Postgres for current state, ClickHouse for history.
A canonical scraping schema
Price observations from many sources:
CREATE TABLE price_events (
source LowCardinality(String),
source_id String,
observed_at DateTime,
price_cents Int32,
currency LowCardinality(FixedString(3)),
in_stock UInt8,
proxy_region LowCardinality(String),
response_ms UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(observed_at)
ORDER BY (source, source_id, observed_at);
Three things to notice:
LowCardinalityfor repeated-string columns (source names, region codes). Stored as dictionaries; massive compression.PARTITION BY toYYYYMM, old months can be dropped or cold-storaged.ORDER BY (source, source_id, observed_at), this is your primary access pattern. ClickHouse skips entire data parts whose(source, source_id)range doesn't overlap your query.
ORDER BY choice is the single most important schema decision. Get it right based on the queries you'll run.
Ingestion: batch, not row-by-row
ClickHouse is bad at single-row inserts. Aim for batches of 1k–100k rows per INSERT. Single inserts trigger background merges and amplify writes.
from clickhouse_connect import get_client
client = get_client(host="localhost")
rows = []
for item in batch:
rows.append((
item["source"], item["source_id"], item["observed_at"],
item["price_cents"], item["currency"], int(item["in_stock"]),
item["proxy_region"], item["response_ms"]
))
client.insert("price_events", rows, column_names=[
"source", "source_id", "observed_at", "price_cents", "currency",
"in_stock", "proxy_region", "response_ms"
])
In PHP via the HTTP interface:
$client = new ClickHouseDB\Client(['host' => 'localhost']);
$client->insert('price_events', $rows, [
'source', 'source_id', 'observed_at', 'price_cents',
'currency', 'in_stock', 'proxy_region', 'response_ms'
]);
For high-throughput pipelines, route inserts through Kafka with ClickHouse Kafka engine consuming. Direct inserts work fine up to a few thousand rows per second per node.
Queries that fly
-- Median price per source over the last 30 days
SELECT
source,
quantile(0.5)(price_cents) / 100 AS median_usd,
count() AS observations
FROM price_events
WHERE observed_at > now() - INTERVAL 30 DAY
GROUP BY source
ORDER BY median_usd DESC
LIMIT 20;
Over 500M rows on a single node, that's sub-second. Postgres on the same hardware would take 30s+ even with good indexes.
-- Price change detection: when did price drop > 10%?
SELECT
source, source_id,
observed_at,
price_cents,
lagInFrame(price_cents) OVER (
PARTITION BY source, source_id ORDER BY observed_at
) AS prev_price
FROM price_events
WHERE source = 'catalog108' AND price_cents < prev_price * 0.9
LIMIT 100;
Window functions are first-class.
What ClickHouse is bad at
- Per-row updates. Possible via
ALTER TABLE ... UPDATE, but they're asynchronous mutations and expensive. Don't model state you'd update, re-insert with a version column and useReplacingMergeTreeif you must. - Many small inserts. Buffer them. Use the Kafka or Buffer table engines.
- Strong transactions. ClickHouse is eventually-consistent across replicas.
- JOINs at scale. They work but plan around them, dictionary lookups and pre-joins are faster patterns.
Materialized views for rollups
Instead of querying raw events every time:
CREATE MATERIALIZED VIEW daily_price_summary
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (source, source_id, day)
AS SELECT
source, source_id,
toDate(observed_at) AS day,
count() AS observations,
min(price_cents) AS min_price,
max(price_cents) AS max_price
FROM price_events
GROUP BY source, source_id, day;
Inserts to price_events automatically aggregate into the materialized view. Daily reports read the small rollup table, not the raw events.
Cost economics
ClickHouse on a 4-core / 16GB VM happily handles 100M+ events. Compared to a Postgres setup needing 64GB+ to do similar analytical queries, it's substantially cheaper. Disk is the main cost, even at 10x compression, billions of events add up. Old partitions can move to cold object storage (S3-backed tables).
What to try
Run ClickHouse locally (docker run -d -p 8123:8123 clickhouse/clickhouse-server), create the price_events schema, generate 10M fake observations from Catalog108 prices over a year, then write the median-per-source query. Watch it return in <500ms on a laptop.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.