Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.54intermediate4 min read

Change Detection (Diff-Only Storage)

Scraping the same pages repeatedly produces 99% redundant data. Store only what changed, and you'll cut storage and downstream load by orders of magnitude.

What you’ll learn

  • Distinguish snapshot, diff, and event-stream storage models.
  • Build a change-detection pipeline with hashing and equality checks.
  • Decide which fields trigger a 'changed' event.

A typical scraper re-fetches the same product pages daily. If you store every observation, you're paying for thousands of nearly-identical rows per product. The smarter pattern: store the current state, and append only changes.

Three storage models

Model What you store Pros Cons
Full snapshots Every scrape, every field Simple; complete history Huge storage; mostly redundant
Diff-only Current row + change log 10–100x smaller; "what changed when" is a direct query More complex writes
Event stream Each scrape as an event in append-only log Replayable; fits ClickHouse Reconstructing state requires aggregation

Diff-only is the sweet spot for most price-tracking, inventory, listing-style scrapes.

The schema

-- Current state
CREATE TABLE products (
  id  BIGSERIAL PRIMARY KEY,
  source  TEXT NOT NULL,
  source_id  TEXT NOT NULL,
  title  TEXT,
  price_cents INTEGER,
  in_stock  BOOLEAN,
  last_seen  TIMESTAMPTZ,
  UNIQUE (source, source_id)
);

-- Change log
CREATE TABLE product_changes (
  id  BIGSERIAL PRIMARY KEY,
  product_id  BIGINT REFERENCES products(id),
  field  TEXT NOT NULL,  -- 'price_cents', 'in_stock', etc.
  old_value  JSONB,
  new_value  JSONB,
  changed_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX product_changes_pid_idx  ON product_changes(product_id, changed_at DESC);
CREATE INDEX product_changes_field_idx ON product_changes(field, changed_at DESC);

Two tables, two access patterns: state lookups against products, history queries against product_changes.

The pipeline

def ingest(item):
  with conn.cursor() as cur:
  cur.execute("""
  SELECT id, title, price_cents, in_stock
  FROM products
  WHERE source = %s AND source_id = %s
  """, (item["source"], item["source_id"]))
  existing = cur.fetchone()

  if existing is None:
  # New product, insert, no diff
  cur.execute("""
  INSERT INTO products (source, source_id, title, price_cents, in_stock, last_seen)
  VALUES (%s, %s, %s, %s, %s, now())
  RETURNING id
  """, (item["source"], item["source_id"], item["title"],
  item["price_cents"], item["in_stock"]))
  return

  pid, old_title, old_price, old_stock = existing
  changes = []
  if item["title"] != old_title:
  changes.append(("title", old_title, item["title"]))
  if item["price_cents"] != old_price:
  changes.append(("price_cents", old_price, item["price_cents"]))
  if item["in_stock"] != old_stock:
  changes.append(("in_stock", old_stock, item["in_stock"]))

  # Always update last_seen so we know we re-checked
  cur.execute("UPDATE products SET last_seen = now() WHERE id = %s", (pid,))

  if not changes:
  return  # unchanged, done

  # Apply new state + log every diff
  cur.execute("""
  UPDATE products SET title = %s, price_cents = %s, in_stock = %s
  WHERE id = %s
  """, (item["title"], item["price_cents"], item["in_stock"], pid))
  for field, old, new in changes:
  cur.execute("""
  INSERT INTO product_changes (product_id, field, old_value, new_value)
  VALUES (%s, %s, %s::jsonb, %s::jsonb)
  """, (pid, field, json.dumps(old), json.dumps(new)))

Reads + compares + writes happen atomically inside a transaction in real code.

What counts as "changed"?

Not every field is interesting. Decide explicitly:

  • Always interesting: price, stock, availability, title.
  • Sometimes interesting: description, image URL, category.
  • Usually noise: timestamps in HTML, CSRF tokens, view counts, rating averages that wobble by 0.01.

For numeric fields with natural fuzziness, threshold:

if abs(item["rating"] - old_rating) > 0.05:
  changes.append(("rating", old_rating, item["rating"]))

Otherwise you'll log a "change" every time a review trickles in.

PHP version with Doctrine

$existing = $conn->fetchAssociative(
  'SELECT id, title, price_cents, in_stock FROM products WHERE source = ? AND source_id = ?',
  [$item['source'], $item['source_id']]
);

if (!$existing) {
  $conn->insert('products', [
  'source' => $item['source'], 'source_id' => $item['source_id'],
  'title' => $item['title'], 'price_cents' => $item['price_cents'],
  'in_stock' => $item['in_stock'], 'last_seen' => new DateTime()
  ]);
  return;
}

$changes = [];
foreach (['title', 'price_cents', 'in_stock'] as $field) {
  if ($item[$field] !== $existing[$field]) {
  $changes[] = [$field, $existing[$field], $item[$field]];
  }
}
$conn->update('products', ['last_seen' => new DateTime()], ['id' => $existing['id']]);

if (!$changes) return;

$conn->update('products', [
  'title' => $item['title'], 'price_cents' => $item['price_cents'],
  'in_stock' => $item['in_stock']
], ['id' => $existing['id']]);

foreach ($changes as [$field, $old, $new]) {
  $conn->insert('product_changes', [
  'product_id' => $existing['id'],
  'field' => $field,
  'old_value' => json_encode($old),
  'new_value' => json_encode($new),
  ]);
}

Why last_seen matters

If you only update on change, a product unchanged for a year looks identical to one that disappeared from the site. last_seen distinguishes them. Items not seen for N days are flagged as delisted, even if their fields never changed.

Querying the change log

-- All price drops in the last 7 days
SELECT p.title, c.old_value::int AS old, c.new_value::int AS new,
  c.changed_at
FROM product_changes c
JOIN products p ON p.id = c.product_id
WHERE c.field = 'price_cents'
  AND c.changed_at > now() - interval '7 days'
  AND (c.new_value::int) < (c.old_value::int);

Direct, indexed, sub-second over millions of rows.

Hands-on lab

Scrape Catalog108 /products daily for a week. Compare:

  • A naive table that stores every observation (SELECT count(*)).
  • The diff-only table.

After 7 days, snapshot-mode is ~7× the storage of diff-only for the same information content. Then query "which products changed price during the week", diff-only answers in one query; snapshot mode needs window functions.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Change Detection (Diff-Only Storage)1 / 8

Which storage model is most space-efficient for a daily-scrape of fields that change rarely?

Score so far: 0 / 0