Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.34intermediate6 min read

Resumable Scraping with Checkpoints

Real scrapers crash, get killed, or are politely stopped mid-run. Resume from where you left off, without re-downloading or duplicating, using checkpoints and state files.

What you’ll learn

  • Track scraper progress in a state file or database.
  • Design a queue-of-URLs pattern with status flags.
  • Skip already-processed URLs on restart.
  • Handle partial failures cleanly without losing progress.

A scraper that has to start from scratch every time isn't a scraper, it's a script. Real scrapes need to survive crashes, network blips, manual interruptions, and OS reboots. The pattern: track what's done, skip it on restart.

The minimum viable checkpoint

Pick a state store, write to it as you go, read from it on startup. Three reasonable stores:

Store Best for
Plain file (one URL per line) Simple, append-friendly
JSONL output (already has every completed row) Output IS the state
SQLite Status flags per URL, queriable

For most projects, SQLite wins, it's queryable, transactional, and the same database stores your data.

SQLite-backed queue

import sqlite3

con = sqlite3.connect("scrape.db")
cur = con.cursor()

cur.execute("""
  CREATE TABLE IF NOT EXISTS scrape_queue (
  url  TEXT PRIMARY KEY,
  status  TEXT NOT NULL DEFAULT 'pending',  -- pending | in_progress | done | failed
  attempts  INTEGER NOT NULL DEFAULT 0,
  error  TEXT,
  scraped_at  TIMESTAMP
  )
""")
con.commit()

# Seed the queue with URLs to scrape
urls = [f"https://practice.scrapingcentral.com/products?page={i}" for i in range(1, 50)]
cur.executemany(
  "INSERT OR IGNORE INTO scrape_queue (url) VALUES (?)",
  [(u,) for u in urls],
)
con.commit()

INSERT OR IGNORE is key: seeding the queue is idempotent. Restarting doesn't re-seed.

The main loop

def scrape_one(url):
  # Your existing scrape logic; raises on failure
  ...
  return data

while True:
  cur.execute("SELECT url FROM scrape_queue WHERE status = 'pending' LIMIT 1")
  row = cur.fetchone()
  if not row:
  break
  (url,) = row

  cur.execute("UPDATE scrape_queue SET status='in_progress' WHERE url=?", (url,))
  con.commit()

  try:
  data = scrape_one(url)
  # Save data wherever it goes
  save(data)
  cur.execute(
  "UPDATE scrape_queue SET status='done', scraped_at=CURRENT_TIMESTAMP WHERE url=?",
  (url,),
  )
  except Exception as e:
  cur.execute(
  "UPDATE scrape_queue SET status='failed', attempts=attempts+1, error=? WHERE url=?",
  (str(e), url),
  )
  con.commit()

This is the canonical shape. Now a restart picks up at the next pending row.

Retrying failed rows

# After the main loop: retry failures up to N times
cur.execute("""
  UPDATE scrape_queue SET status='pending'
  WHERE status='failed' AND attempts < 3
""")
con.commit()
# Then re-enter the main loop

This gives you automatic retry-with-cooldown semantics: a failed URL gets up to N more attempts before being marked permanently failed.

A simpler file-based checkpoint

If SQLite feels heavy for a one-off, append completed URLs to a file:

done_path = "done.txt"

# Load
try:
  with open(done_path) as f:
  done = set(line.strip() for line in f)
except FileNotFoundError:
  done = set()

# Scrape and append
with open(done_path, "a") as out:
  for url in urls:
  if url in done:
  continue
  data = scrape_one(url)
  save(data)
  out.write(url + "\n")
  out.flush()
  done.add(url)

flush() is essential, without it, Python buffers writes and you'll lose state on a crash.

State = output: the JSONL trick

If your output is JSONL and you embed the source URL in each row, the output file itself is your state:

done_urls = set()
try:
  with open("output.jsonl") as f:
  for line in f:
  row = json.loads(line)
  done_urls.add(row["source_url"])
except FileNotFoundError:
  pass

with open("output.jsonl", "a") as out:
  for url in urls:
  if url in done_urls:
  continue
  row = scrape_one(url)
  out.write(json.dumps(row) + "\n")
  out.flush()

No separate state file. Output IS state. This is the simplest pattern for scrapers where every row maps to one URL.

PHP version

$db = new PDO('sqlite:scrape.db');
$db->exec('
  CREATE TABLE IF NOT EXISTS scrape_queue (
  url TEXT PRIMARY KEY,
  status TEXT DEFAULT "pending",
  attempts INTEGER DEFAULT 0,
  error TEXT,
  scraped_at TIMESTAMP
  )
');

// Seed
$urls = array_map(fn($i) => "https://practice.scrapingcentral.com/products?page=$i", range(1, 50));
$stmt = $db->prepare('INSERT OR IGNORE INTO scrape_queue (url) VALUES (?)');
$db->beginTransaction();
foreach ($urls as $url) { $stmt->execute([$url]); }
$db->commit();

// Main loop
while (true) {
  $row = $db->query("SELECT url FROM scrape_queue WHERE status='pending' LIMIT 1")->fetch(PDO::FETCH_ASSOC);
  if (!$row) break;

  $url = $row['url'];
  $db->prepare("UPDATE scrape_queue SET status='in_progress' WHERE url=?")->execute([$url]);

  try {
  $data = scrape_one($url);
  save($data);
  $db->prepare("UPDATE scrape_queue SET status='done', scraped_at=CURRENT_TIMESTAMP WHERE url=?")
  ->execute([$url]);
  } catch (Throwable $e) {
  $db->prepare("UPDATE scrape_queue SET status='failed', attempts=attempts+1, error=? WHERE url=?")
  ->execute([$e->getMessage(), $url]);
  }
}

Handling Ctrl-C cleanly

When the user kills your scraper, you want the current row marked properly (not stuck in in_progress forever):

import signal, sys

interrupted = False

def handle_sigint(sig, frame):
  global interrupted
  interrupted = True
  print("\nInterrupt received; finishing current row then exiting.")

signal.signal(signal.SIGINT, handle_sigint)

while not interrupted:
  # ... main loop body ...
  pass

A graceful interrupt finishes the current URL (writing its result) then exits, leaving the queue consistent.

If a scraper dies mid-row (no graceful interrupt, kill -9 or power loss), rows can be stuck in_progress. Add a startup recovery:

# On startup: reset stuck rows
cur.execute("UPDATE scrape_queue SET status='pending' WHERE status='in_progress'")
con.commit()

This makes the scraper self-healing.

Concurrency

For multi-worker scrapers reading from the same queue:

# Each worker:
cur.execute("BEGIN IMMEDIATE")  # acquire write lock
cur.execute("SELECT url FROM scrape_queue WHERE status='pending' LIMIT 1")
row = cur.fetchone()
if row:
  cur.execute("UPDATE scrape_queue SET status='in_progress' WHERE url=?", (row[0],))
con.commit()
# ... work ...

BEGIN IMMEDIATE prevents two workers from grabbing the same URL. SQLite serializes writes; for hundreds of workers, this becomes a bottleneck and you'd want Redis or Postgres SKIP LOCKED.

What to checkpoint, and what not to

Checkpoint:

  • URLs / IDs that succeeded or failed
  • Pagination cursors at a coarse grain ("done up to page 47")
  • Auth state (cookies, tokens), to avoid re-logging-in every restart

Don't checkpoint:

  • Every intermediate variable in every loop
  • Partial HTML responses (just refetch on resume)
  • Random ephemeral state

The rule of thumb: checkpoint anything expensive to recompute (HTTP fetches, complex parses, login flows).

Hands-on lab

Build a resumable scrape over /products?page=N for 50 pages. Use SQLite for the queue. Run it; halfway through, kill the process with Ctrl-C. Restart, confirm it picks up at the right page, doesn't re-download what's already done. Then deliberately introduce a fault (e.g., raise an exception on one page) and confirm the queue marks it as failed without halting the rest.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Resumable Scraping with Checkpoints1 / 8

Why use `INSERT OR IGNORE` when seeding a scrape queue from a list of URLs?

Score so far: 0 / 0