Metrics with Prometheus + Grafana, Production, Scale & Career

Logs tell you what happened. Metrics tell you the shape of normal vs abnormal. Prometheus scraping plus Grafana dashboards is the de facto stack.

Logs are a per-event record. Metrics are a numerical time-series of system behaviour: how many requests succeeded per minute, how long they took, what the proxy success rate is. They answer "is the system healthy right now?" much faster than logs do.

Prometheus is the standard open-source metrics database. Grafana is the standard UI for it. Together they're the spine of most production observability stacks.

The pull model

Prometheus pulls. Each scraper (or sidecar) exposes an HTTP endpoint, conventionally /metrics, returning a plain-text snapshot:

# HELP scraper_requests_total Total requests sent.
# TYPE scraper_requests_total counter
scraper_requests_total{spider="products",status="200"} 12453
scraper_requests_total{spider="products",status="503"} 47

# HELP scraper_request_duration_seconds Histogram of request latency.
# TYPE scraper_request_duration_seconds histogram
scraper_request_duration_seconds_bucket{spider="products",le="0.5"} 8400
scraper_request_duration_seconds_bucket{spider="products",le="1.0"} 12000
scraper_request_duration_seconds_bucket{spider="products",le="+Inf"} 12500
scraper_request_duration_seconds_sum{spider="products"} 5482.3
scraper_request_duration_seconds_count{spider="products"} 12500

Prometheus scrapes this every 15s. Each scrape is a sample in the time series.

The four metric types

Type	What it does	Example
Counter	Monotonically increasing total	total requests, total bytes downloaded
Gauge	Value that goes up and down	current queue depth, in-flight requests
Histogram	Bucketed observations + sum + count	request latency distribution
Summary	Quantiles (calculated client-side)	rarely needed, histograms are usually better

Most scrapers need counters (requests, items scraped) and histograms (request latency, parse time). Gauges show queue depth or active workers.

Python: prometheus_client

from prometheus_client import Counter, Histogram, Gauge, start_http_server

REQUESTS = Counter(
  "scraper_requests_total",
  "Total HTTP requests.",
  ["spider", "status"]
)
LATENCY = Histogram(
  "scraper_request_duration_seconds",
  "Request latency in seconds.",
  ["spider"],
  buckets=[0.1, 0.25, 0.5, 1, 2, 5, 10, 30]
)
QUEUE_DEPTH = Gauge("scraper_queue_depth", "Items in queue.")

start_http_server(8000)  # exposes /metrics on :8000

def fetch(url):
  spider = "products"
  with LATENCY.labels(spider=spider).time():
  resp = httpx.get(url)
  REQUESTS.labels(spider=spider, status=str(resp.status_code)).inc()
  QUEUE_DEPTH.set(len(queue))

The .time() context manager auto-records duration. The .labels() dimension lets one metric carry multiple sub-series.

PHP: prometheus_client_php

use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis;

$adapter = new Redis(['host' => 'redis', 'port' => 6379]);
$registry = new CollectorRegistry($adapter);

$counter = $registry->getOrRegisterCounter(
  'scraper', 'requests_total', 'Total HTTP requests.', ['spider', 'status']
);
$counter->inc(['products', '200']);

$histogram = $registry->getOrRegisterHistogram(
  'scraper', 'request_duration_seconds', 'Request latency.', ['spider'],
  [0.1, 0.25, 0.5, 1, 2, 5, 10, 30]
);
$histogram->observe($durationSec, ['products']);

// Expose at /metrics, Symfony controller:
$renderer = new \Prometheus\RenderTextFormat();
return new Response($renderer->render($registry->getMetricFamilySamples()), 200,
  ['Content-Type' => \Prometheus\RenderTextFormat::MIME_TYPE]);

PHP processes are short-lived; the Redis storage adapter aggregates across requests/workers.

Labels, useful but dangerous

Labels add dimensions. Critical rule: labels must be low cardinality. Don't label by URL or user ID. Do label by spider, status code (small set), region, env.

Why: Prometheus stores each unique label combination as a separate time-series. URL labels explode into millions of series and kill the database.

Histograms: latency done right

A histogram of request latency lets you compute:

Average: rate(scraper_request_duration_seconds_sum[5m]) / rate(scraper_request_duration_seconds_count[5m])
p95: histogram_quantile(0.95, sum by (le) (rate(scraper_request_duration_seconds_bucket[5m])))

Pick bucket boundaries that span your real distribution. For scrapers, common buckets: 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60 seconds. Too coarse and you can't compute p95 precisely; too fine and storage costs balloon.

Prometheus scrape config

# prometheus.yml
scrape_configs:
  - job_name: scrapers
  static_configs:
  - targets:
  - scraper-1:8000
  - scraper-2:8000
  metrics_path: /metrics
  scrape_interval: 15s

In Kubernetes, use the Prometheus Operator's ServiceMonitor; pods are discovered automatically.

Grafana dashboards

Connect Grafana to Prometheus as a data source. Build panels for:

Panel	PromQL
Requests/sec by status	`sum by (status) (rate(scraper_requests_total[1m]))`
Error rate	`sum(rate(scraper_requests_total{status=~"5.."}[5m])) / sum(rate(scraper_requests_total[5m]))`
p95 latency	`histogram_quantile(0.95, sum by (le) (rate(scraper_request_duration_seconds_bucket[5m])))`
Queue depth	`scraper_queue_depth`
Active workers	`count(up{job="scrapers"})`

These five panels are 80% of what an on-call engineer looks at first.

Alerts (preview)

Prometheus Alertmanager fires rules like:

- alert: HighScrapeFailureRate
  expr: |
  sum(rate(scraper_requests_total{status=~"5..|429"}[5m]))
  / sum(rate(scraper_requests_total[5m])) > 0.05
  for: 10m
  annotations:
  summary: ">5% failure rate for 10+ minutes"

Lesson 59 covers alerting end-to-end.

What to try

Add prometheus_client to your Catalog108 scraper. Expose :8000/metrics. Run Prometheus + Grafana locally with a docker-compose stack. Build a dashboard with requests/sec, error rate, and p95 latency. Watch it update in real time as your scraper runs.

Metrics with Prometheus + Grafana

What you’ll learn