Change Detection (Diff-Only Storage)
Scraping the same pages repeatedly produces 99% redundant data. Store only what changed, and you'll cut storage and downstream load by orders of magnitude.
What you’ll learn
- Distinguish snapshot, diff, and event-stream storage models.
- Build a change-detection pipeline with hashing and equality checks.
- Decide which fields trigger a 'changed' event.
A typical scraper re-fetches the same product pages daily. If you store every observation, you're paying for thousands of nearly-identical rows per product. The smarter pattern: store the current state, and append only changes.
Three storage models
| Model | What you store | Pros | Cons |
|---|---|---|---|
| Full snapshots | Every scrape, every field | Simple; complete history | Huge storage; mostly redundant |
| Diff-only | Current row + change log | 10–100x smaller; "what changed when" is a direct query | More complex writes |
| Event stream | Each scrape as an event in append-only log | Replayable; fits ClickHouse | Reconstructing state requires aggregation |
Diff-only is the sweet spot for most price-tracking, inventory, listing-style scrapes.
The schema
-- Current state
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
source_id TEXT NOT NULL,
title TEXT,
price_cents INTEGER,
in_stock BOOLEAN,
last_seen TIMESTAMPTZ,
UNIQUE (source, source_id)
);
-- Change log
CREATE TABLE product_changes (
id BIGSERIAL PRIMARY KEY,
product_id BIGINT REFERENCES products(id),
field TEXT NOT NULL, -- 'price_cents', 'in_stock', etc.
old_value JSONB,
new_value JSONB,
changed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX product_changes_pid_idx ON product_changes(product_id, changed_at DESC);
CREATE INDEX product_changes_field_idx ON product_changes(field, changed_at DESC);
Two tables, two access patterns: state lookups against products, history queries against product_changes.
The pipeline
def ingest(item):
with conn.cursor() as cur:
cur.execute("""
SELECT id, title, price_cents, in_stock
FROM products
WHERE source = %s AND source_id = %s
""", (item["source"], item["source_id"]))
existing = cur.fetchone()
if existing is None:
# New product, insert, no diff
cur.execute("""
INSERT INTO products (source, source_id, title, price_cents, in_stock, last_seen)
VALUES (%s, %s, %s, %s, %s, now())
RETURNING id
""", (item["source"], item["source_id"], item["title"],
item["price_cents"], item["in_stock"]))
return
pid, old_title, old_price, old_stock = existing
changes = []
if item["title"] != old_title:
changes.append(("title", old_title, item["title"]))
if item["price_cents"] != old_price:
changes.append(("price_cents", old_price, item["price_cents"]))
if item["in_stock"] != old_stock:
changes.append(("in_stock", old_stock, item["in_stock"]))
# Always update last_seen so we know we re-checked
cur.execute("UPDATE products SET last_seen = now() WHERE id = %s", (pid,))
if not changes:
return # unchanged, done
# Apply new state + log every diff
cur.execute("""
UPDATE products SET title = %s, price_cents = %s, in_stock = %s
WHERE id = %s
""", (item["title"], item["price_cents"], item["in_stock"], pid))
for field, old, new in changes:
cur.execute("""
INSERT INTO product_changes (product_id, field, old_value, new_value)
VALUES (%s, %s, %s::jsonb, %s::jsonb)
""", (pid, field, json.dumps(old), json.dumps(new)))
Reads + compares + writes happen atomically inside a transaction in real code.
What counts as "changed"?
Not every field is interesting. Decide explicitly:
- Always interesting: price, stock, availability, title.
- Sometimes interesting: description, image URL, category.
- Usually noise: timestamps in HTML, CSRF tokens, view counts, rating averages that wobble by 0.01.
For numeric fields with natural fuzziness, threshold:
if abs(item["rating"] - old_rating) > 0.05:
changes.append(("rating", old_rating, item["rating"]))
Otherwise you'll log a "change" every time a review trickles in.
PHP version with Doctrine
$existing = $conn->fetchAssociative(
'SELECT id, title, price_cents, in_stock FROM products WHERE source = ? AND source_id = ?',
[$item['source'], $item['source_id']]
);
if (!$existing) {
$conn->insert('products', [
'source' => $item['source'], 'source_id' => $item['source_id'],
'title' => $item['title'], 'price_cents' => $item['price_cents'],
'in_stock' => $item['in_stock'], 'last_seen' => new DateTime()
]);
return;
}
$changes = [];
foreach (['title', 'price_cents', 'in_stock'] as $field) {
if ($item[$field] !== $existing[$field]) {
$changes[] = [$field, $existing[$field], $item[$field]];
}
}
$conn->update('products', ['last_seen' => new DateTime()], ['id' => $existing['id']]);
if (!$changes) return;
$conn->update('products', [
'title' => $item['title'], 'price_cents' => $item['price_cents'],
'in_stock' => $item['in_stock']
], ['id' => $existing['id']]);
foreach ($changes as [$field, $old, $new]) {
$conn->insert('product_changes', [
'product_id' => $existing['id'],
'field' => $field,
'old_value' => json_encode($old),
'new_value' => json_encode($new),
]);
}
Why last_seen matters
If you only update on change, a product unchanged for a year looks identical to one that disappeared from the site. last_seen distinguishes them. Items not seen for N days are flagged as delisted, even if their fields never changed.
Querying the change log
-- All price drops in the last 7 days
SELECT p.title, c.old_value::int AS old, c.new_value::int AS new,
c.changed_at
FROM product_changes c
JOIN products p ON p.id = c.product_id
WHERE c.field = 'price_cents'
AND c.changed_at > now() - interval '7 days'
AND (c.new_value::int) < (c.old_value::int);
Direct, indexed, sub-second over millions of rows.
Hands-on lab
Scrape Catalog108 /products daily for a week. Compare:
- A naive table that stores every observation (
SELECT count(*)). - The diff-only table.
After 7 days, snapshot-mode is ~7× the storage of diff-only for the same information content. Then query "which products changed price during the week", diff-only answers in one query; snapshot mode needs window functions.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.