Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.49intermediate4 min read

PostgreSQL for Structured Scraped Data

Postgres is the default sink for production scrapers. Schema design, upserts, JSONB for semi-structured fields, and the indexes that keep ingestion fast.

What you’ll learn

  • Design a schema that survives schema drift in scraped sources.
  • Use UPSERT (ON CONFLICT) for idempotent ingestion.
  • Pick the right index for the access pattern.

Most scraped datasets land in Postgres. It's free, battle-tested, has JSONB for the messy bits, and every language has a first-class driver. This lesson is the minimum schema and ingest pattern you need.

A schema that survives drift

Scraped sources change. New fields appear, old ones vanish. Don't model every column rigidly; reserve a JSONB column for "everything else."

CREATE TABLE products (
  id  BIGSERIAL PRIMARY KEY,
  source  TEXT NOT NULL,  -- "catalog108", "shop-x"
  source_id  TEXT NOT NULL,  -- the site's product ID
  url  TEXT NOT NULL,
  title  TEXT,
  price_cents  INTEGER,
  currency  CHAR(3),
  in_stock  BOOLEAN,
  raw  JSONB NOT NULL DEFAULT '{}'::jsonb,
  scraped_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (source, source_id)
);

CREATE INDEX products_source_idx  ON products (source);
CREATE INDEX products_scraped_at_idx ON products (scraped_at DESC);
CREATE INDEX products_raw_gin  ON products USING GIN (raw);

Three things to notice:

  • Composite uniqueness on (source, source_id). Lets you upsert without colliding across sources.
  • raw JSONB. Anything the spider extracts that doesn't fit columns goes here. New fields don't need migrations.
  • GIN index on raw. Lets you query raw @> '{"category": "kitchen"}' quickly.

Store money as integer cents (INTEGER or BIGINT), never FLOAT. Floats lose pennies; users notice.

Idempotent ingest with ON CONFLICT

Scrapers re-run. The same product appears across pages. You want one row per (source, source_id), updated when it changes.

INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw, scraped_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, now())
ON CONFLICT (source, source_id) DO UPDATE SET
  url  = EXCLUDED.url,
  title  = EXCLUDED.title,
  price_cents = EXCLUDED.price_cents,
  currency  = EXCLUDED.currency,
  in_stock  = EXCLUDED.in_stock,
  raw  = EXCLUDED.raw,
  scraped_at  = EXCLUDED.scraped_at;

In Python with psycopg:

import psycopg
with psycopg.connect(DSN) as conn, conn.cursor() as cur:
  cur.execute("""
  INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw)
  VALUES (%s, %s, %s, %s, %s, %s, %s, %s::jsonb)
  ON CONFLICT (source, source_id) DO UPDATE SET
  title = EXCLUDED.title, price_cents = EXCLUDED.price_cents,
  in_stock = EXCLUDED.in_stock, raw = EXCLUDED.raw, scraped_at = now()
  """, ("catalog108", item["id"], item["url"], item["title"],
  item["price_cents"], item["currency"], item["in_stock"], json.dumps(item)))

In PHP with Doctrine DBAL:

$conn->executeStatement(
  'INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw)
  VALUES (?, ?, ?, ?, ?, ?, ?, ?::jsonb)
  ON CONFLICT (source, source_id) DO UPDATE SET
  title = EXCLUDED.title, price_cents = EXCLUDED.price_cents,
  in_stock = EXCLUDED.in_stock, raw = EXCLUDED.raw, scraped_at = now()',
  [$source, $sourceId, $url, $title, $priceCents, $currency, $inStock, json_encode($raw)]
);

Both run idempotently, re-running the scraper is safe.

Batch inserts beat row-by-row

Postgres is fast, but the network round-trip per row isn't. For bulk inserts, use:

  • COPY ... FROM STDIN, fastest. Use psycopg.copy() or PHP's pg_put_line.
  • Multi-row INSERT, second fastest. 500–1000 rows per statement.
  • Batched executemany, slower than the above, but easy.

A spider doing 50k inserts row-by-row over the network can spend more time on commit overhead than parsing. Batching to 1000 rows per statement typically gives 10–50x ingest speedup.

Indexes: pick by access pattern

Query pattern Index
WHERE source = ? B-tree on source
WHERE scraped_at > now() - interval '1 day' B-tree on scraped_at
WHERE raw @> '{"category": "kitchen"}' GIN on raw
WHERE title ILIKE '%blender%' pg_trgm GIN on title
WHERE source = ? AND in_stock = true Composite B-tree on (source, in_stock)

Avoid index sprawl: each index slows writes. Profile with EXPLAIN ANALYZE before adding.

Time-series of prices

If you track price history, don't update, append.

CREATE TABLE price_history (
  product_id  BIGINT REFERENCES products(id),
  price_cents INTEGER NOT NULL,
  observed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX price_history_pid_time_idx ON price_history (product_id, observed_at DESC);

Most-recent-only stays on products. Full history lives in price_history. For huge datasets, partition by month.

Connection pooling

Each Postgres connection costs ~10MB of RAM server-side. A scraper with 200 workers shouldn't open 200 direct connections. Use PgBouncer (transaction-mode) in front. Drivers connect to PgBouncer; PgBouncer multiplexes onto a small backend pool (10–30 connections is typical).

Vacuum and bloat

Frequent updates create dead tuples. Postgres's autovacuum usually keeps up, but for high-write tables tune:

  • autovacuum_vacuum_scale_factor = 0.05 (vacuum more aggressively)
  • Run VACUUM ANALYZE after big bulk loads.
  • Watch pg_stat_user_tables.n_dead_tup.

What to try

Spin up Postgres locally (docker run -e POSTGRES_PASSWORD=x -p 5432:5432 postgres:16), create the schema above, then point your Catalog108 scraper at it with the upsert pattern. Run the same scrape twice, the row count shouldn't double; only scraped_at should change.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

PostgreSQL for Structured Scraped Data1 / 8

Why does the lesson model the products table with a composite UNIQUE on (source, source_id) rather than a global UNIQUE on source_id?

Score so far: 0 / 0