Incident Runbooks for Scrapers
A runbook is the checklist you wish you had at 3am. The structure, the most common scraper incidents, and the runbook entries that cut mean-time-to-recover.
What you’ll learn
- Write a runbook entry that a sleepy on-call can follow.
- Catalog the five most common scraper incident types.
- Build the muscle memory: triage, mitigate, fix, postmortem.
A runbook is a per-alert, per-incident playbook. It exists because at 3am, the human on call isn't going to remember the right CLI flag, and the team's most senior engineer is asleep. A good runbook makes "I'm not the expert" matter less.
The runbook shape
For each known incident, document:
## Alert: HighScrapeFailureRate
### Symptom
Failure rate >10% for 10+ minutes. Slack-notified to #scraping-oncall.
### Triage (do these in order)
1. Open Grafana → "Scrapers" dashboard. Which spider? Which status codes?
2. If 5xx-heavy, check target's status page (target.com/status, downforeveryone).
3. If 429-heavy, check our request rate vs the published limits.
4. If 403-heavy, check proxy-region ban panel, single region or all?
### Likely causes and mitigations
- Target site degraded (5xx): pause writes for 30 min, monitor for resolution. No code change.
- We're rate-limited (429): reduce `CONCURRENT_REQUESTS_PER_DOMAIN` by 50%; redeploy.
- Proxy region banned (403): rotate proxies via `kubectl set env deployment/scraper PROXY_POOL=backup`.
### Rollback
If you just deployed: `kubectl rollout undo deployment/scraper`.
### Escalation
If unresolved in 30 min, page the owning team (#scraping-oncall + @scraping-lead in PagerDuty).
### Postmortem checklist
- Was this alert symptom or cause? Update wording if cause-based.
- New mitigation step learned? Add to triage list.
- New automated remediation worth building? File ticket.
Five sections, each short. Optimised for skim, not study.
The five common scraper incidents
1. Target site changed HTML
Signal: Items/sec drops to ~0 while request rate is steady.
Triage: Pick a recent successful URL from logs. Curl it. Diff against an older saved snapshot. Run the spider against the page locally and inspect what the parser extracted.
Mitigation: Hotfix the selector. Most teams keep a "selector fallback chain", try the new selector, fall back to old, log both. While you patch, downstream consumers know to expect stale data.
2. Proxy pool degraded
Signal: 403 / 429 spike. Proxy success rate panel shows one pool collapsing.
Triage: Confirm with the proxy provider's status page. Test the pool directly with curl --proxy ....
Mitigation: Failover to a backup pool. Most providers offer multiple pools; configure the scraper to switch via env var without redeploy.
3. Database / queue down
Signal: Writes erroring. Queue depth ballooning. Worker logs show connection refused.
Triage: Is the DB host reachable? PgBouncer / Redis up? Disk full?
Mitigation: If the DB is recovering, workers can buffer to local disk or pause cleanly. Plan for this: don't let workers crash on transient DB unavailability, back off and retry.
4. Memory leak / runaway worker
Signal: Worker pod restarted by OOMKiller; latency spike before restart.
Triage: Check memory graph. Did the leak start after a deploy?
Mitigation: Kubernetes will restart; while it does, capacity drops temporarily. Set memory requests/limits + restart policies so the blast radius is bounded.
5. Captcha wall
Signal: Sudden ban rate spike, often after weeks of clean traffic; specific URLs return captcha pages.
Triage: Check what changed: new IP block, fingerprint, missed challenge.
Mitigation: Slow down. Rotate IPs. Engage the captcha solver if you have one. If it's a serious wall, this is a project, not a runbook fix.
Incident response phases
Detect → Triage → Mitigate → Fix → Postmortem
- Detect is automatic, alerts.
- Triage is the runbook's triage section. Goal: identify the rough cause in 5 minutes.
- Mitigate is "stop the bleeding." Not necessarily a permanent fix.
- Fix comes after mitigation, often the next day, in code.
- Postmortem is the meta: what could we have caught earlier?
Don't conflate mitigate and fix. At 3am, mitigate. Fix in office hours.
Postmortems
For every P1 incident (and important P2s), write a postmortem:
## Incident: 2026-05-12, Catalog108 scraper down 2 hours
### Timeline
- 14:02 UTC: Alert fired. Failure rate 80%.
- 14:08 UTC: On-call confirmed, page acknowledged.
- 14:15 UTC: Identified: Catalog108 rolled out a new login flow that 302'd all our anonymous requests.
- 14:30 UTC: Workaround deployed (skip protected URLs, keep public).
- 16:00 UTC: Real fix deployed.
### Root cause
Target site changed the SSO redirect; our scraper followed redirects into a login wall.
### Contributing factors
- No allowlist of expected response patterns.
- Our 'success' metric was 2xx, 302 still counted as success until follow-up failed.
### What we got right
- Detection: 4 minutes from change to alert.
- Mitigation: workaround in 22 minutes.
### Action items
- Add a content-validation check (is the page recognisable as the product page?).
- Treat 302-to-login-domain as a failure in metrics.
- Add Catalog108 SSO test to the daily smoke tests.
### Was this preventable?
Yes if we had content-validation. Filed: SCRAPE-201.
Postmortems are blameless. The goal is system improvement, not punishment.
Runbook hygiene
- Co-locate with alerts. Every alert's
runbookannotation links to a heading in this doc. - Update during the incident. When you learn a new triage step in real time, add it before you forget.
- Test runbooks. Game days where you trigger known failure modes; ensure the runbook actually works.
- Delete stale entries. Half-true runbooks are worse than none.
What to write today
Pick the highest-frequency alert in your scraper system. Write a runbook entry for it using the five-section template. Save it somewhere all on-call engineers can reach in <30 seconds (internal wiki, a runbooks/ folder in the repo, Notion). Link the alert's annotation to that URL.
The discipline of writing one runbook entry tells you whether the alert itself is well-defined. If you can't write triage steps because the alert is too vague, fix the alert.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.