SQLite for Embedded Scraper Storage
SQLite is the perfect scraper backend: zero-config, file-based, queryable. Skip CSV/JSON for scrapes you'll re-run or query.
What you’ll learn
- Create a SQLite database from Python or PHP.
- Define schema for scraped records, including timestamps and unique keys.
- Insert and upsert efficiently in bulk.
- Query the data for analysis without leaving SQLite.
Once a scraper grows past one-shot runs, CSV/JSON outputs become painful: deduplication, incremental updates, "show me everything I scraped on Tuesday." SQLite solves all of this in a single file with zero setup.
Why SQLite for scrapers
- Zero config. Single file on disk. No server. Ships with Python and PHP.
- Real SQL. SELECT, JOIN, GROUP BY, the same queries you'd run in Postgres.
- Concurrent reads. Multiple processes can read at once.
- Transactional. Crashes don't corrupt the file (with proper journaling, which is default).
- Portable. Move the file; the database moves with it.
For scrapers under ~10GB and ~100 writes/second, SQLite is genuinely the right answer. Beyond that, look at Postgres.
Python, the basics
import sqlite3
con = sqlite3.connect("scrape.db")
cur = con.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_url TEXT NOT NULL UNIQUE,
name TEXT,
price REAL,
category TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
con.commit()
cur.execute(
"INSERT INTO products (source_url, name, price, category) VALUES (?, ?, ?, ?)",
("https://practice.scrapingcentral.com/products/yellow-mug", "Yellow mug", 14.99, "kitchen"),
)
con.commit()
con.close()
Three things every scraper schema should have:
- A
source_url(or source-key) column withUNIQUE. This is your dedup anchor. - A
scraped_attimestamp. When did this row come in? - Real types,
REALfor prices,INTEGERfor counts. Don't dump everything as TEXT.
Parameterised queries always
Never:
cur.execute(f"INSERT INTO products VALUES ('{name}')") # SQL injection waiting to happen
Always:
cur.execute("INSERT INTO products (name) VALUES (?)", (name,))
Even for trusted scraper data, the field could contain quotes that break your SQL. Parameterised queries are universal best practice.
UPSERT, insert or update
When re-scraping, you want "insert if new, update if changed":
cur.execute("""
INSERT INTO products (source_url, name, price)
VALUES (?, ?, ?)
ON CONFLICT(source_url) DO UPDATE SET
name = excluded.name,
price = excluded.price,
scraped_at = CURRENT_TIMESTAMP
""", (url, name, price))
ON CONFLICT(...) DO UPDATE is SQLite's (and PostgreSQL's) upsert syntax. Works since SQLite 3.24 (2018). The excluded.X references the values you tried to insert.
Bulk insert, much faster
rows = [(url, name, price) for url, name, price in scraped_rows]
con.execute("BEGIN")
cur.executemany(
"INSERT OR REPLACE INTO products (source_url, name, price) VALUES (?, ?, ?)",
rows,
)
con.commit()
Two changes from naive per-row inserts:
executemany, batches the binding.- Explicit transaction (
BEGIN ... COMMIT), without this, SQLite commits per row, ~100x slower.
For 10,000 rows: ~10 seconds without batching, ~0.1 seconds with. Always batch.
Querying the data
cur.execute("SELECT category, COUNT(*), AVG(price) FROM products GROUP BY category")
for row in cur.fetchall():
print(row)
cur.execute("SELECT * FROM products WHERE price < 20 ORDER BY price DESC LIMIT 5")
for row in cur.fetchall():
print(dict(zip([d[0] for d in cur.description], row)))
Real SQL. Grouping, aggregation, JOINs across multiple tables you might create (products + reviews + categories), all native, all fast on a properly-indexed database.
Indexes, when scrapes get big
cur.execute("CREATE INDEX IF NOT EXISTS idx_products_category ON products(category)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_products_scraped_at ON products(scraped_at)")
Without indexes, every WHERE clause does a full table scan. For tables over ~10,000 rows, this matters. Index any column you filter or order by frequently.
PHP, the basics
$db = new PDO('sqlite:scrape.db');
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$db->exec('
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_url TEXT NOT NULL UNIQUE,
name TEXT,
price REAL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
');
$stmt = $db->prepare(
'INSERT INTO products (source_url, name, price) VALUES (:url, :name, :price)'
);
$stmt->execute([
':url' => 'https://practice.scrapingcentral.com/products/yellow-mug',
':name' => 'Yellow mug',
':price' => 14.99,
]);
PDO is the standard PHP database interface. The SQL is identical to Python's; just different binding syntax.
PHP UPSERT and bulk
$db->beginTransaction();
$stmt = $db->prepare('
INSERT INTO products (source_url, name, price)
VALUES (:url, :name, :price)
ON CONFLICT(source_url) DO UPDATE SET
name = excluded.name,
price = excluded.price,
scraped_at = CURRENT_TIMESTAMP
');
foreach ($rows as $row) {
$stmt->execute([':url' => $row['url'], ':name' => $row['name'], ':price' => $row['price']]);
}
$db->commit();
beginTransaction + bulk binds + commit, same speed-up pattern as Python.
Concurrent writes
SQLite handles many readers but only one writer at a time. If you have multiple scraper workers writing concurrently, they'll serialize at the database layer. Two options:
- Single writer process, all workers send results to a queue, one process drains it.
- WAL mode, enable Write-Ahead Logging for better concurrency:
con.execute("PRAGMA journal_mode=WAL")
WAL allows reads to proceed even while a write is in progress. Significant performance boost for mixed read/write workloads.
Querying from the CLI
sqlite3 scrape.db "SELECT category, COUNT(*) FROM products GROUP BY category"
Or interactively:
sqlite3 scrape.db
sqlite> .schema
sqlite> .tables
sqlite> SELECT * FROM products LIMIT 5;
sqlite> .quit
A scraper that writes to SQLite gives you an interactive analysis tool for free. No exporting needed.
Migrating to JSON or CSV
import json
cur.execute("SELECT * FROM products")
cols = [d[0] for d in cur.description]
rows = [dict(zip(cols, r)) for r in cur.fetchall()]
with open("export.json", "w") as f:
json.dump(rows, f, indent=2)
The SQLite database is your source of truth; exports are derivative.
When SQLite isn't enough
- Many concurrent writers (>10 processes hammering the same file).
- Multi-machine scrape farm.
- Datasets > ~20-50GB where performance falls off.
- Heavy schema migrations on live data (better tooling in Postgres).
In all these cases, switch to PostgreSQL. The SQL you wrote for SQLite is mostly portable.
Hands-on lab
Scrape /products (first 5 pages), save to a SQLite database with a proper schema (UNIQUE on source URL, timestamps, real types). Use UPSERT so re-running doesn't duplicate rows. Then run sqlite3 scrape.db "SELECT category, AVG(price) FROM products GROUP BY category" and confirm the aggregate matches what you'd expect from inspecting the site.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.