Alerting (Slack, Email, PagerDuty)
Alerts wake people up. Wrong alerts wake them up for nothing. The principles and concrete configs for scraper alerts that engineers thank you for.
What you’ll learn
- Design alerts on rate-based thresholds, not raw values.
- Route alerts by severity to the right channel.
- Avoid alert fatigue with deduplication and grouping.
Metrics tell you what's happening. Alerts decide what humans hear about. The hard problem isn't writing alert rules, it's making sure alerts mean "drop what you're doing" and never "ignore me, this happens every Tuesday."
This lesson covers the principles and the concrete Alertmanager / PagerDuty wiring for scraper alerts.
The three tiers
| Severity | Where it goes | Response time | Example |
|---|---|---|---|
| Page (P1) | PagerDuty / OpsGenie → SMS / call | 5–15 min | Scraper down for >30min, data store unreachable |
| Ticket (P2) | Slack channel + on-call queue | 1–4 hours | Sustained ban rate > 10% |
| Info (P3) | Slack channel, low-priority | Next business day | Single proxy region degraded |
Get the routing right. P1s that turn out to be P3s burn trust in the alerting system; P3s routed as P1s burn the human.
Alertmanager rules, the right shape
A bad alert:
- alert: BadScraper
expr: scraper_errors_total > 100
This fires forever once errors hit 100, there's no time window, no rate, no recovery.
A good alert:
- alert: HighScrapeFailureRate
expr: |
sum(rate(scraper_requests_total{status=~"5..|429"}[5m]))
/ sum(rate(scraper_requests_total[5m])) > 0.10
for: 10m
labels:
severity: ticket
annotations:
summary: "Scraper failure rate >10% for 10+ minutes"
description: "Failure rate is {{ $value | humanizePercentage }}. Check /metrics, then proxy and target."
runbook: "https://docs.internal/runbooks/scraper-failure-rate"
Five properties of a good alert:
- Rate-based, not raw count. Catches problems independent of total scale.
for: 10msuppresses transient flaps.- Severity label routes to the right channel.
- Runbook link. When paged at 3am, the engineer doesn't want to think, they want a checklist.
- Description with templated context. Show the actual value, not "something's wrong."
Sample rule set for scrapers
groups:
- name: scraper-critical
rules:
- alert: ScraperDown
expr: up{job="scrapers"} == 0
for: 5m
labels: {severity: page}
annotations: {summary: "Scraper {{ $labels.instance }} unreachable for 5m"}
- alert: NoItemsScraped
expr: rate(scraper_items_total[15m]) == 0
for: 15m
labels: {severity: page}
annotations: {summary: "No items scraped for 15m, parser broken?"}
- name: scraper-warnings
rules:
- alert: HighFailureRate
expr: |
sum(rate(scraper_requests_total{status=~"5..|429"}[5m]))
/ sum(rate(scraper_requests_total[5m])) > 0.10
for: 10m
labels: {severity: ticket}
- alert: ProxyPoolDegraded
expr: |
sum by (proxy_pool) (rate(proxy_requests_total{outcome="success"}[5m]))
/ sum by (proxy_pool) (rate(proxy_requests_total[5m])) < 0.7
for: 10m
labels: {severity: ticket}
- alert: FreshnessLagHigh
expr: scraper_freshness_lag_seconds > 86400
for: 30m
labels: {severity: ticket}
annotations: {summary: "Freshness lag > 24h"}
Notice the patterns: critical alerts use stricter conditions and shorter windows, warnings use rates and longer windows.
Routing via Alertmanager
route:
receiver: slack-noisy
group_by: [alertname, spider]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers: [severity="page"]
receiver: pagerduty
group_wait: 0s
- matchers: [severity="ticket"]
receiver: slack-oncall
- matchers: [severity="info"]
receiver: slack-noisy
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}
- name: slack-oncall
slack_configs:
- api_url: ${SLACK_ONCALL_WEBHOOK}
channel: "#scraping-oncall"
send_resolved: true
- name: slack-noisy
slack_configs:
- api_url: ${SLACK_NOISY_WEBHOOK}
channel: "#scraping-info"
send_resolved: false
Three properties to tune:
group_bycollapses similar alerts. If five scrapers all hit HighFailureRate, you get one notification listing all five.group_waitis the delay before sending the first notification, lets correlated alerts bundle.repeat_intervalstops the same alert spamming every minute. 4h is a sensible default.
Slack message templates
Default Slack output is ugly. Customize:
slack_configs:
- api_url: ${SLACK_ONCALL_WEBHOOK}
channel: "#scraping-oncall"
title: '{{ if eq .Status "firing" }}FIRING{{ else }}RESOLVED{{ end }}: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*{{ .Annotations.summary }}*
Spider: {{ .Labels.spider }}
{{ .Annotations.description }}
Runbook: {{ .Annotations.runbook }}
{{ end }}
Two things make Slack alerts useful: a clear status (firing/resolved) and the runbook link inline.
PagerDuty integration
PagerDuty's Events API V2 is what Alertmanager talks to. Set up a "service" in PagerDuty, get the integration key, paste it into service_key. Create on-call schedules and escalation policies in PagerDuty's UI, don't try to encode them in Alertmanager.
Test by intentionally triggering an alert (or use PagerDuty's "send test event" feature) before depending on the integration in production.
Avoiding alert fatigue
Symptoms: engineers muting Slack channels, alerts ignored, an actual incident missed because "the alerts looked normal."
Fixes:
- Audit weekly. Of last week's alerts, which led to action? Disable or retune the rest.
- Inhibit downstream alerts when upstream is firing. If "Database down" is firing, don't also page "scraper_db_writes_failing", Alertmanager's
inhibit_ruleshandles this. - Maintenance windows. Suppress alerts during planned deploys/migrations.
- Symptom-based, not cause-based. Alert on "users see errors" not "this internal counter is X." The latter creates dozens of alerts for one problem.
When to alert vs not
- Alert when there's a clear human action required.
- Don't alert when the system will self-heal in minutes (transient retries).
- Don't alert when the dashboard will show it just as well during business hours.
Every alert should pass the test: "Would I want to be woken up for this at 3am?" If no, Slack-only, or no alert at all.
What to try
Set up Alertmanager + a Slack webhook. Write three rules for your Catalog108 scraper: ScraperDown (page), HighFailureRate (ticket), ProxyDegraded (ticket). Trigger them deliberately by stopping the scraper and by simulating bans. Notice the difference in how each is delivered.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.