Resumable Scraping with Checkpoints
Real scrapers crash, get killed, or are politely stopped mid-run. Resume from where you left off, without re-downloading or duplicating, using checkpoints and state files.
What you’ll learn
- Track scraper progress in a state file or database.
- Design a queue-of-URLs pattern with status flags.
- Skip already-processed URLs on restart.
- Handle partial failures cleanly without losing progress.
A scraper that has to start from scratch every time isn't a scraper, it's a script. Real scrapes need to survive crashes, network blips, manual interruptions, and OS reboots. The pattern: track what's done, skip it on restart.
The minimum viable checkpoint
Pick a state store, write to it as you go, read from it on startup. Three reasonable stores:
| Store | Best for |
|---|---|
| Plain file (one URL per line) | Simple, append-friendly |
| JSONL output (already has every completed row) | Output IS the state |
| SQLite | Status flags per URL, queriable |
For most projects, SQLite wins, it's queryable, transactional, and the same database stores your data.
SQLite-backed queue
import sqlite3
con = sqlite3.connect("scrape.db")
cur = con.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS scrape_queue (
url TEXT PRIMARY KEY,
status TEXT NOT NULL DEFAULT 'pending', -- pending | in_progress | done | failed
attempts INTEGER NOT NULL DEFAULT 0,
error TEXT,
scraped_at TIMESTAMP
)
""")
con.commit()
# Seed the queue with URLs to scrape
urls = [f"https://practice.scrapingcentral.com/products?page={i}" for i in range(1, 50)]
cur.executemany(
"INSERT OR IGNORE INTO scrape_queue (url) VALUES (?)",
[(u,) for u in urls],
)
con.commit()
INSERT OR IGNORE is key: seeding the queue is idempotent. Restarting doesn't re-seed.
The main loop
def scrape_one(url):
# Your existing scrape logic; raises on failure
...
return data
while True:
cur.execute("SELECT url FROM scrape_queue WHERE status = 'pending' LIMIT 1")
row = cur.fetchone()
if not row:
break
(url,) = row
cur.execute("UPDATE scrape_queue SET status='in_progress' WHERE url=?", (url,))
con.commit()
try:
data = scrape_one(url)
# Save data wherever it goes
save(data)
cur.execute(
"UPDATE scrape_queue SET status='done', scraped_at=CURRENT_TIMESTAMP WHERE url=?",
(url,),
)
except Exception as e:
cur.execute(
"UPDATE scrape_queue SET status='failed', attempts=attempts+1, error=? WHERE url=?",
(str(e), url),
)
con.commit()
This is the canonical shape. Now a restart picks up at the next pending row.
Retrying failed rows
# After the main loop: retry failures up to N times
cur.execute("""
UPDATE scrape_queue SET status='pending'
WHERE status='failed' AND attempts < 3
""")
con.commit()
# Then re-enter the main loop
This gives you automatic retry-with-cooldown semantics: a failed URL gets up to N more attempts before being marked permanently failed.
A simpler file-based checkpoint
If SQLite feels heavy for a one-off, append completed URLs to a file:
done_path = "done.txt"
# Load
try:
with open(done_path) as f:
done = set(line.strip() for line in f)
except FileNotFoundError:
done = set()
# Scrape and append
with open(done_path, "a") as out:
for url in urls:
if url in done:
continue
data = scrape_one(url)
save(data)
out.write(url + "\n")
out.flush()
done.add(url)
flush() is essential, without it, Python buffers writes and you'll lose state on a crash.
State = output: the JSONL trick
If your output is JSONL and you embed the source URL in each row, the output file itself is your state:
done_urls = set()
try:
with open("output.jsonl") as f:
for line in f:
row = json.loads(line)
done_urls.add(row["source_url"])
except FileNotFoundError:
pass
with open("output.jsonl", "a") as out:
for url in urls:
if url in done_urls:
continue
row = scrape_one(url)
out.write(json.dumps(row) + "\n")
out.flush()
No separate state file. Output IS state. This is the simplest pattern for scrapers where every row maps to one URL.
PHP version
$db = new PDO('sqlite:scrape.db');
$db->exec('
CREATE TABLE IF NOT EXISTS scrape_queue (
url TEXT PRIMARY KEY,
status TEXT DEFAULT "pending",
attempts INTEGER DEFAULT 0,
error TEXT,
scraped_at TIMESTAMP
)
');
// Seed
$urls = array_map(fn($i) => "https://practice.scrapingcentral.com/products?page=$i", range(1, 50));
$stmt = $db->prepare('INSERT OR IGNORE INTO scrape_queue (url) VALUES (?)');
$db->beginTransaction();
foreach ($urls as $url) { $stmt->execute([$url]); }
$db->commit();
// Main loop
while (true) {
$row = $db->query("SELECT url FROM scrape_queue WHERE status='pending' LIMIT 1")->fetch(PDO::FETCH_ASSOC);
if (!$row) break;
$url = $row['url'];
$db->prepare("UPDATE scrape_queue SET status='in_progress' WHERE url=?")->execute([$url]);
try {
$data = scrape_one($url);
save($data);
$db->prepare("UPDATE scrape_queue SET status='done', scraped_at=CURRENT_TIMESTAMP WHERE url=?")
->execute([$url]);
} catch (Throwable $e) {
$db->prepare("UPDATE scrape_queue SET status='failed', attempts=attempts+1, error=? WHERE url=?")
->execute([$e->getMessage(), $url]);
}
}
Handling Ctrl-C cleanly
When the user kills your scraper, you want the current row marked properly (not stuck in in_progress forever):
import signal, sys
interrupted = False
def handle_sigint(sig, frame):
global interrupted
interrupted = True
print("\nInterrupt received; finishing current row then exiting.")
signal.signal(signal.SIGINT, handle_sigint)
while not interrupted:
# ... main loop body ...
pass
A graceful interrupt finishes the current URL (writing its result) then exits, leaving the queue consistent.
If a scraper dies mid-row (no graceful interrupt, kill -9 or power loss), rows can be stuck in_progress. Add a startup recovery:
# On startup: reset stuck rows
cur.execute("UPDATE scrape_queue SET status='pending' WHERE status='in_progress'")
con.commit()
This makes the scraper self-healing.
Concurrency
For multi-worker scrapers reading from the same queue:
# Each worker:
cur.execute("BEGIN IMMEDIATE") # acquire write lock
cur.execute("SELECT url FROM scrape_queue WHERE status='pending' LIMIT 1")
row = cur.fetchone()
if row:
cur.execute("UPDATE scrape_queue SET status='in_progress' WHERE url=?", (row[0],))
con.commit()
# ... work ...
BEGIN IMMEDIATE prevents two workers from grabbing the same URL. SQLite serializes writes; for hundreds of workers, this becomes a bottleneck and you'd want Redis or Postgres SKIP LOCKED.
What to checkpoint, and what not to
Checkpoint:
- URLs / IDs that succeeded or failed
- Pagination cursors at a coarse grain ("done up to page 47")
- Auth state (cookies, tokens), to avoid re-logging-in every restart
Don't checkpoint:
- Every intermediate variable in every loop
- Partial HTML responses (just refetch on resume)
- Random ephemeral state
The rule of thumb: checkpoint anything expensive to recompute (HTTP fetches, complex parses, login flows).
Hands-on lab
Build a resumable scrape over /products?page=N for 50 pages. Use SQLite for the queue. Run it; halfway through, kill the process with Ctrl-C. Restart, confirm it picks up at the right page, doesn't re-download what's already done. Then deliberately introduce a fault (e.g., raise an exception on one page) and confirm the queue marks it as failed without halting the rest.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.