Incident Runbooks for Scrapers, Production, Scale & Career

A runbook is the checklist you wish you had at 3am. The structure, the most common scraper incidents, and the runbook entries that cut mean-time-to-recover.

A runbook is a per-alert, per-incident playbook. It exists because at 3am, the human on call isn't going to remember the right CLI flag, and the team's most senior engineer is asleep. A good runbook makes "I'm not the expert" matter less.

The runbook shape

For each known incident, document:

## Alert: HighScrapeFailureRate

### Symptom
Failure rate >10% for 10+ minutes. Slack-notified to #scraping-oncall.

### Triage (do these in order)
1. Open Grafana → "Scrapers" dashboard. Which spider? Which status codes?
2. If 5xx-heavy, check target's status page (target.com/status, downforeveryone).
3. If 429-heavy, check our request rate vs the published limits.
4. If 403-heavy, check proxy-region ban panel, single region or all?

### Likely causes and mitigations
- Target site degraded (5xx): pause writes for 30 min, monitor for resolution. No code change.
- We're rate-limited (429): reduce `CONCURRENT_REQUESTS_PER_DOMAIN` by 50%; redeploy.
- Proxy region banned (403): rotate proxies via `kubectl set env deployment/scraper PROXY_POOL=backup`.

### Rollback
If you just deployed: `kubectl rollout undo deployment/scraper`.

### Escalation
If unresolved in 30 min, page the owning team (#scraping-oncall + @scraping-lead in PagerDuty).

### Postmortem checklist
- Was this alert symptom or cause? Update wording if cause-based.
- New mitigation step learned? Add to triage list.
- New automated remediation worth building? File ticket.

Five sections, each short. Optimised for skim, not study.

The five common scraper incidents

1. Target site changed HTML

Signal: Items/sec drops to ~0 while request rate is steady.

Triage: Pick a recent successful URL from logs. Curl it. Diff against an older saved snapshot. Run the spider against the page locally and inspect what the parser extracted.

Mitigation: Hotfix the selector. Most teams keep a "selector fallback chain", try the new selector, fall back to old, log both. While you patch, downstream consumers know to expect stale data.

2. Proxy pool degraded

Signal: 403 / 429 spike. Proxy success rate panel shows one pool collapsing.

Triage: Confirm with the proxy provider's status page. Test the pool directly with curl --proxy ....

Mitigation: Failover to a backup pool. Most providers offer multiple pools; configure the scraper to switch via env var without redeploy.

3. Database / queue down

Signal: Writes erroring. Queue depth ballooning. Worker logs show connection refused.

Triage: Is the DB host reachable? PgBouncer / Redis up? Disk full?

Mitigation: If the DB is recovering, workers can buffer to local disk or pause cleanly. Plan for this: don't let workers crash on transient DB unavailability, back off and retry.

4. Memory leak / runaway worker

Signal: Worker pod restarted by OOMKiller; latency spike before restart.

Triage: Check memory graph. Did the leak start after a deploy?

Mitigation: Kubernetes will restart; while it does, capacity drops temporarily. Set memory requests/limits + restart policies so the blast radius is bounded.

5. Captcha wall

Signal: Sudden ban rate spike, often after weeks of clean traffic; specific URLs return captcha pages.

Triage: Check what changed: new IP block, fingerprint, missed challenge.

Mitigation: Slow down. Rotate IPs. Engage the captcha solver if you have one. If it's a serious wall, this is a project, not a runbook fix.

Incident response phases

Detect  →  Triage  →  Mitigate  →  Fix  →  Postmortem

Detect is automatic, alerts.
Triage is the runbook's triage section. Goal: identify the rough cause in 5 minutes.
Mitigate is "stop the bleeding." Not necessarily a permanent fix.
Fix comes after mitigation, often the next day, in code.
Postmortem is the meta: what could we have caught earlier?

Don't conflate mitigate and fix. At 3am, mitigate. Fix in office hours.

Postmortems

For every P1 incident (and important P2s), write a postmortem:

## Incident: 2026-05-12, Catalog108 scraper down 2 hours

### Timeline
- 14:02 UTC: Alert fired. Failure rate 80%.
- 14:08 UTC: On-call confirmed, page acknowledged.
- 14:15 UTC: Identified: Catalog108 rolled out a new login flow that 302'd all our anonymous requests.
- 14:30 UTC: Workaround deployed (skip protected URLs, keep public).
- 16:00 UTC: Real fix deployed.

### Root cause
Target site changed the SSO redirect; our scraper followed redirects into a login wall.

### Contributing factors
- No allowlist of expected response patterns.
- Our 'success' metric was 2xx, 302 still counted as success until follow-up failed.

### What we got right
- Detection: 4 minutes from change to alert.
- Mitigation: workaround in 22 minutes.

### Action items
- Add a content-validation check (is the page recognisable as the product page?).
- Treat 302-to-login-domain as a failure in metrics.
- Add Catalog108 SSO test to the daily smoke tests.

### Was this preventable?
Yes if we had content-validation. Filed: SCRAPE-201.

Postmortems are blameless. The goal is system improvement, not punishment.

Runbook hygiene

Co-locate with alerts. Every alert's runbook annotation links to a heading in this doc.
Update during the incident. When you learn a new triage step in real time, add it before you forget.
Test runbooks. Game days where you trigger known failure modes; ensure the runbook actually works.
Delete stale entries. Half-true runbooks are worse than none.

What to write today

Pick the highest-frequency alert in your scraper system. Write a runbook entry for it using the five-section template. Save it somewhere all on-call engineers can reach in <30 seconds (internal wiki, a runbooks/ folder in the repo, Notion). Link the alert's annotation to that URL.

The discipline of writing one runbook entry tells you whether the alert itself is well-defined. If you can't write triage steps because the alert is too vague, fix the alert.

Incident Runbooks for Scrapers

What you’ll learn