Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.30intermediate5 min read

SQLite for Embedded Scraper Storage

SQLite is the perfect scraper backend: zero-config, file-based, queryable. Skip CSV/JSON for scrapes you'll re-run or query.

What you’ll learn

  • Create a SQLite database from Python or PHP.
  • Define schema for scraped records, including timestamps and unique keys.
  • Insert and upsert efficiently in bulk.
  • Query the data for analysis without leaving SQLite.

Once a scraper grows past one-shot runs, CSV/JSON outputs become painful: deduplication, incremental updates, "show me everything I scraped on Tuesday." SQLite solves all of this in a single file with zero setup.

Why SQLite for scrapers

  • Zero config. Single file on disk. No server. Ships with Python and PHP.
  • Real SQL. SELECT, JOIN, GROUP BY, the same queries you'd run in Postgres.
  • Concurrent reads. Multiple processes can read at once.
  • Transactional. Crashes don't corrupt the file (with proper journaling, which is default).
  • Portable. Move the file; the database moves with it.

For scrapers under ~10GB and ~100 writes/second, SQLite is genuinely the right answer. Beyond that, look at Postgres.

Python, the basics

import sqlite3

con = sqlite3.connect("scrape.db")
cur = con.cursor()

cur.execute("""
  CREATE TABLE IF NOT EXISTS products (
  id  INTEGER PRIMARY KEY AUTOINCREMENT,
  source_url  TEXT NOT NULL UNIQUE,
  name  TEXT,
  price  REAL,
  category  TEXT,
  scraped_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
  )
""")
con.commit()

cur.execute(
  "INSERT INTO products (source_url, name, price, category) VALUES (?, ?, ?, ?)",
  ("https://practice.scrapingcentral.com/products/yellow-mug", "Yellow mug", 14.99, "kitchen"),
)
con.commit()

con.close()

Three things every scraper schema should have:

  1. A source_url (or source-key) column with UNIQUE. This is your dedup anchor.
  2. A scraped_at timestamp. When did this row come in?
  3. Real types, REAL for prices, INTEGER for counts. Don't dump everything as TEXT.

Parameterised queries always

Never:

cur.execute(f"INSERT INTO products VALUES ('{name}')")  # SQL injection waiting to happen

Always:

cur.execute("INSERT INTO products (name) VALUES (?)", (name,))

Even for trusted scraper data, the field could contain quotes that break your SQL. Parameterised queries are universal best practice.

UPSERT, insert or update

When re-scraping, you want "insert if new, update if changed":

cur.execute("""
  INSERT INTO products (source_url, name, price)
  VALUES (?, ?, ?)
  ON CONFLICT(source_url) DO UPDATE SET
  name = excluded.name,
  price = excluded.price,
  scraped_at = CURRENT_TIMESTAMP
""", (url, name, price))

ON CONFLICT(...) DO UPDATE is SQLite's (and PostgreSQL's) upsert syntax. Works since SQLite 3.24 (2018). The excluded.X references the values you tried to insert.

Bulk insert, much faster

rows = [(url, name, price) for url, name, price in scraped_rows]

con.execute("BEGIN")
cur.executemany(
  "INSERT OR REPLACE INTO products (source_url, name, price) VALUES (?, ?, ?)",
  rows,
)
con.commit()

Two changes from naive per-row inserts:

  1. executemany, batches the binding.
  2. Explicit transaction (BEGIN ... COMMIT), without this, SQLite commits per row, ~100x slower.

For 10,000 rows: ~10 seconds without batching, ~0.1 seconds with. Always batch.

Querying the data

cur.execute("SELECT category, COUNT(*), AVG(price) FROM products GROUP BY category")
for row in cur.fetchall():
  print(row)

cur.execute("SELECT * FROM products WHERE price < 20 ORDER BY price DESC LIMIT 5")
for row in cur.fetchall():
  print(dict(zip([d[0] for d in cur.description], row)))

Real SQL. Grouping, aggregation, JOINs across multiple tables you might create (products + reviews + categories), all native, all fast on a properly-indexed database.

Indexes, when scrapes get big

cur.execute("CREATE INDEX IF NOT EXISTS idx_products_category ON products(category)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_products_scraped_at ON products(scraped_at)")

Without indexes, every WHERE clause does a full table scan. For tables over ~10,000 rows, this matters. Index any column you filter or order by frequently.

PHP, the basics

$db = new PDO('sqlite:scrape.db');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$db->exec('
  CREATE TABLE IF NOT EXISTS products (
  id  INTEGER PRIMARY KEY AUTOINCREMENT,
  source_url  TEXT NOT NULL UNIQUE,
  name  TEXT,
  price  REAL,
  scraped_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
  )
');

$stmt = $db->prepare(
  'INSERT INTO products (source_url, name, price) VALUES (:url, :name, :price)'
);
$stmt->execute([
  ':url'  => 'https://practice.scrapingcentral.com/products/yellow-mug',
  ':name'  => 'Yellow mug',
  ':price' => 14.99,
]);

PDO is the standard PHP database interface. The SQL is identical to Python's; just different binding syntax.

PHP UPSERT and bulk

$db->beginTransaction();
$stmt = $db->prepare('
  INSERT INTO products (source_url, name, price)
  VALUES (:url, :name, :price)
  ON CONFLICT(source_url) DO UPDATE SET
  name = excluded.name,
  price = excluded.price,
  scraped_at = CURRENT_TIMESTAMP
');
foreach ($rows as $row) {
  $stmt->execute([':url' => $row['url'], ':name' => $row['name'], ':price' => $row['price']]);
}
$db->commit();

beginTransaction + bulk binds + commit, same speed-up pattern as Python.

Concurrent writes

SQLite handles many readers but only one writer at a time. If you have multiple scraper workers writing concurrently, they'll serialize at the database layer. Two options:

  1. Single writer process, all workers send results to a queue, one process drains it.
  2. WAL mode, enable Write-Ahead Logging for better concurrency:
con.execute("PRAGMA journal_mode=WAL")

WAL allows reads to proceed even while a write is in progress. Significant performance boost for mixed read/write workloads.

Querying from the CLI

sqlite3 scrape.db "SELECT category, COUNT(*) FROM products GROUP BY category"

Or interactively:

sqlite3 scrape.db
sqlite> .schema
sqlite> .tables
sqlite> SELECT * FROM products LIMIT 5;
sqlite> .quit

A scraper that writes to SQLite gives you an interactive analysis tool for free. No exporting needed.

Migrating to JSON or CSV

import json

cur.execute("SELECT * FROM products")
cols = [d[0] for d in cur.description]
rows = [dict(zip(cols, r)) for r in cur.fetchall()]

with open("export.json", "w") as f:
  json.dump(rows, f, indent=2)

The SQLite database is your source of truth; exports are derivative.

When SQLite isn't enough

  • Many concurrent writers (>10 processes hammering the same file).
  • Multi-machine scrape farm.
  • Datasets > ~20-50GB where performance falls off.
  • Heavy schema migrations on live data (better tooling in Postgres).

In all these cases, switch to PostgreSQL. The SQL you wrote for SQLite is mostly portable.

Hands-on lab

Scrape /products (first 5 pages), save to a SQLite database with a proper schema (UNIQUE on source URL, timestamps, real types). Use UPSERT so re-running doesn't duplicate rows. Then run sqlite3 scrape.db "SELECT category, AVG(price) FROM products GROUP BY category" and confirm the aggregate matches what you'd expect from inspecting the site.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

SQLite for Embedded Scraper Storage1 / 8

Why is SQLite a good fit for many scrapers?

Score so far: 0 / 0