PostgreSQL for Structured Scraped Data
Postgres is the default sink for production scrapers. Schema design, upserts, JSONB for semi-structured fields, and the indexes that keep ingestion fast.
What you’ll learn
- Design a schema that survives schema drift in scraped sources.
- Use UPSERT (ON CONFLICT) for idempotent ingestion.
- Pick the right index for the access pattern.
Most scraped datasets land in Postgres. It's free, battle-tested, has JSONB for the messy bits, and every language has a first-class driver. This lesson is the minimum schema and ingest pattern you need.
A schema that survives drift
Scraped sources change. New fields appear, old ones vanish. Don't model every column rigidly; reserve a JSONB column for "everything else."
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL, -- "catalog108", "shop-x"
source_id TEXT NOT NULL, -- the site's product ID
url TEXT NOT NULL,
title TEXT,
price_cents INTEGER,
currency CHAR(3),
in_stock BOOLEAN,
raw JSONB NOT NULL DEFAULT '{}'::jsonb,
scraped_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (source, source_id)
);
CREATE INDEX products_source_idx ON products (source);
CREATE INDEX products_scraped_at_idx ON products (scraped_at DESC);
CREATE INDEX products_raw_gin ON products USING GIN (raw);
Three things to notice:
- Composite uniqueness on
(source, source_id). Lets you upsert without colliding across sources. raw JSONB. Anything the spider extracts that doesn't fit columns goes here. New fields don't need migrations.- GIN index on
raw. Lets you queryraw @> '{"category": "kitchen"}'quickly.
Store money as integer cents (INTEGER or BIGINT), never FLOAT. Floats lose pennies; users notice.
Idempotent ingest with ON CONFLICT
Scrapers re-run. The same product appears across pages. You want one row per (source, source_id), updated when it changes.
INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw, scraped_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, now())
ON CONFLICT (source, source_id) DO UPDATE SET
url = EXCLUDED.url,
title = EXCLUDED.title,
price_cents = EXCLUDED.price_cents,
currency = EXCLUDED.currency,
in_stock = EXCLUDED.in_stock,
raw = EXCLUDED.raw,
scraped_at = EXCLUDED.scraped_at;
In Python with psycopg:
import psycopg
with psycopg.connect(DSN) as conn, conn.cursor() as cur:
cur.execute("""
INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s::jsonb)
ON CONFLICT (source, source_id) DO UPDATE SET
title = EXCLUDED.title, price_cents = EXCLUDED.price_cents,
in_stock = EXCLUDED.in_stock, raw = EXCLUDED.raw, scraped_at = now()
""", ("catalog108", item["id"], item["url"], item["title"],
item["price_cents"], item["currency"], item["in_stock"], json.dumps(item)))
In PHP with Doctrine DBAL:
$conn->executeStatement(
'INSERT INTO products (source, source_id, url, title, price_cents, currency, in_stock, raw)
VALUES (?, ?, ?, ?, ?, ?, ?, ?::jsonb)
ON CONFLICT (source, source_id) DO UPDATE SET
title = EXCLUDED.title, price_cents = EXCLUDED.price_cents,
in_stock = EXCLUDED.in_stock, raw = EXCLUDED.raw, scraped_at = now()',
[$source, $sourceId, $url, $title, $priceCents, $currency, $inStock, json_encode($raw)]
);
Both run idempotently, re-running the scraper is safe.
Batch inserts beat row-by-row
Postgres is fast, but the network round-trip per row isn't. For bulk inserts, use:
COPY ... FROM STDIN, fastest. Usepsycopg.copy()or PHP'spg_put_line.- Multi-row
INSERT, second fastest. 500–1000 rows per statement. - Batched
executemany, slower than the above, but easy.
A spider doing 50k inserts row-by-row over the network can spend more time on commit overhead than parsing. Batching to 1000 rows per statement typically gives 10–50x ingest speedup.
Indexes: pick by access pattern
| Query pattern | Index |
|---|---|
WHERE source = ? |
B-tree on source |
WHERE scraped_at > now() - interval '1 day' |
B-tree on scraped_at |
WHERE raw @> '{"category": "kitchen"}' |
GIN on raw |
WHERE title ILIKE '%blender%' |
pg_trgm GIN on title |
WHERE source = ? AND in_stock = true |
Composite B-tree on (source, in_stock) |
Avoid index sprawl: each index slows writes. Profile with EXPLAIN ANALYZE before adding.
Time-series of prices
If you track price history, don't update, append.
CREATE TABLE price_history (
product_id BIGINT REFERENCES products(id),
price_cents INTEGER NOT NULL,
observed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX price_history_pid_time_idx ON price_history (product_id, observed_at DESC);
Most-recent-only stays on products. Full history lives in price_history. For huge datasets, partition by month.
Connection pooling
Each Postgres connection costs ~10MB of RAM server-side. A scraper with 200 workers shouldn't open 200 direct connections. Use PgBouncer (transaction-mode) in front. Drivers connect to PgBouncer; PgBouncer multiplexes onto a small backend pool (10–30 connections is typical).
Vacuum and bloat
Frequent updates create dead tuples. Postgres's autovacuum usually keeps up, but for high-write tables tune:
autovacuum_vacuum_scale_factor = 0.05(vacuum more aggressively)- Run
VACUUM ANALYZEafter big bulk loads. - Watch
pg_stat_user_tables.n_dead_tup.
What to try
Spin up Postgres locally (docker run -e POSTGRES_PASSWORD=x -p 5432:5432 postgres:16), create the schema above, then point your Catalog108 scraper at it with the upsert pattern. Run the same scrape twice, the row count shouldn't double; only scraped_at should change.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.