Alerting (Slack, Email, PagerDuty), Production, Scale & Career

Alerts wake people up. Wrong alerts wake them up for nothing. The principles and concrete configs for scraper alerts that engineers thank you for.

Metrics tell you what's happening. Alerts decide what humans hear about. The hard problem isn't writing alert rules, it's making sure alerts mean "drop what you're doing" and never "ignore me, this happens every Tuesday."

This lesson covers the principles and the concrete Alertmanager / PagerDuty wiring for scraper alerts.

The three tiers

Severity	Where it goes	Response time	Example
Page (P1)	PagerDuty / OpsGenie → SMS / call	5–15 min	Scraper down for >30min, data store unreachable
Ticket (P2)	Slack channel + on-call queue	1–4 hours	Sustained ban rate > 10%
Info (P3)	Slack channel, low-priority	Next business day	Single proxy region degraded

Get the routing right. P1s that turn out to be P3s burn trust in the alerting system; P3s routed as P1s burn the human.

Alertmanager rules, the right shape

A bad alert:

- alert: BadScraper
  expr: scraper_errors_total > 100

This fires forever once errors hit 100, there's no time window, no rate, no recovery.

A good alert:

- alert: HighScrapeFailureRate
  expr: |
  sum(rate(scraper_requests_total{status=~"5..|429"}[5m]))
  / sum(rate(scraper_requests_total[5m])) > 0.10
  for: 10m
  labels:
  severity: ticket
  annotations:
  summary: "Scraper failure rate >10% for 10+ minutes"
  description: "Failure rate is {{ $value | humanizePercentage }}. Check /metrics, then proxy and target."
  runbook: "https://docs.internal/runbooks/scraper-failure-rate"

Five properties of a good alert:

Rate-based, not raw count. Catches problems independent of total scale.
for: 10m suppresses transient flaps.
Severity label routes to the right channel.
Runbook link. When paged at 3am, the engineer doesn't want to think, they want a checklist.
Description with templated context. Show the actual value, not "something's wrong."

Sample rule set for scrapers

groups:
  - name: scraper-critical
  rules:
  - alert: ScraperDown
  expr: up{job="scrapers"} == 0
  for: 5m
  labels: {severity: page}
  annotations: {summary: "Scraper {{ $labels.instance }} unreachable for 5m"}

  - alert: NoItemsScraped
  expr: rate(scraper_items_total[15m]) == 0
  for: 15m
  labels: {severity: page}
  annotations: {summary: "No items scraped for 15m, parser broken?"}

  - name: scraper-warnings
  rules:
  - alert: HighFailureRate
  expr: |
  sum(rate(scraper_requests_total{status=~"5..|429"}[5m]))
  / sum(rate(scraper_requests_total[5m])) > 0.10
  for: 10m
  labels: {severity: ticket}

  - alert: ProxyPoolDegraded
  expr: |
  sum by (proxy_pool) (rate(proxy_requests_total{outcome="success"}[5m]))
  / sum by (proxy_pool) (rate(proxy_requests_total[5m])) < 0.7
  for: 10m
  labels: {severity: ticket}

  - alert: FreshnessLagHigh
  expr: scraper_freshness_lag_seconds > 86400
  for: 30m
  labels: {severity: ticket}
  annotations: {summary: "Freshness lag > 24h"}

Notice the patterns: critical alerts use stricter conditions and shorter windows, warnings use rates and longer windows.

Routing via Alertmanager

route:
  receiver: slack-noisy
  group_by: [alertname, spider]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - matchers: [severity="page"]
  receiver: pagerduty
  group_wait: 0s
  - matchers: [severity="ticket"]
  receiver: slack-oncall
  - matchers: [severity="info"]
  receiver: slack-noisy

receivers:
  - name: pagerduty
  pagerduty_configs:
  - service_key: ${PAGERDUTY_KEY}
  - name: slack-oncall
  slack_configs:
  - api_url: ${SLACK_ONCALL_WEBHOOK}
  channel: "#scraping-oncall"
  send_resolved: true
  - name: slack-noisy
  slack_configs:
  - api_url: ${SLACK_NOISY_WEBHOOK}
  channel: "#scraping-info"
  send_resolved: false

Three properties to tune:

group_by collapses similar alerts. If five scrapers all hit HighFailureRate, you get one notification listing all five.
group_wait is the delay before sending the first notification, lets correlated alerts bundle.
repeat_interval stops the same alert spamming every minute. 4h is a sensible default.

Slack message templates

Default Slack output is ugly. Customize:

slack_configs:
  - api_url: ${SLACK_ONCALL_WEBHOOK}
  channel: "#scraping-oncall"
  title: '{{ if eq .Status "firing" }}FIRING{{ else }}RESOLVED{{ end }}: {{ .GroupLabels.alertname }}'
  text: |
  {{ range .Alerts }}
  *{{ .Annotations.summary }}*
  Spider: {{ .Labels.spider }}
  {{ .Annotations.description }}
  Runbook: {{ .Annotations.runbook }}
  {{ end }}

Two things make Slack alerts useful: a clear status (firing/resolved) and the runbook link inline.

PagerDuty integration

PagerDuty's Events API V2 is what Alertmanager talks to. Set up a "service" in PagerDuty, get the integration key, paste it into service_key. Create on-call schedules and escalation policies in PagerDuty's UI, don't try to encode them in Alertmanager.

Test by intentionally triggering an alert (or use PagerDuty's "send test event" feature) before depending on the integration in production.

Avoiding alert fatigue

Symptoms: engineers muting Slack channels, alerts ignored, an actual incident missed because "the alerts looked normal."

Fixes:

Audit weekly. Of last week's alerts, which led to action? Disable or retune the rest.
Inhibit downstream alerts when upstream is firing. If "Database down" is firing, don't also page "scraper_db_writes_failing", Alertmanager's inhibit_rules handles this.
Maintenance windows. Suppress alerts during planned deploys/migrations.
Symptom-based, not cause-based. Alert on "users see errors" not "this internal counter is X." The latter creates dozens of alerts for one problem.

When to alert vs not

Alert when there's a clear human action required.
Don't alert when the system will self-heal in minutes (transient retries).
Don't alert when the dashboard will show it just as well during business hours.

Every alert should pass the test: "Would I want to be woken up for this at 3am?" If no, Slack-only, or no alert at all.

What to try

Set up Alertmanager + a Slack webhook. Write three rules for your Catalog108 scraper: ScraperDown (page), HighFailureRate (ticket), ProxyDegraded (ticket). Trigger them deliberately by stopping the scraper and by simulating bans. Notice the difference in how each is delivered.

Alerting (Slack, Email, PagerDuty)

What you’ll learn