Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Monitoring Scrapers - Logging and Alerts

Set up logging, monitoring, and alerting for your web scrapers to catch failures before they become data gaps.

Deployment · #7intermediate3 min read
Share:WhatsAppLinkedIn

A scraper that silently fails is worse than one that loudly crashes. Proper monitoring ensures you know the moment something goes wrong, before missing data becomes a problem.

What to Monitor

Metric Why It Matters
Success rate Dropping below 95% signals blocks or site changes
Response time Sudden spikes may indicate throttling
Items scraped Zero items usually means the page structure changed
Error count Spike in errors means something broke
Proxy health Track which proxies are working

Structured Logging

Use structured logging so you can query and filter logs:

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
        }
        if hasattr(record, "url"):
            log_entry["url"] = record.url
        if hasattr(record, "status_code"):
            log_entry["status_code"] = record.status_code
        if hasattr(record, "items_count"):
            log_entry["items_count"] = record.items_count
        return json.dumps(log_entry)

# Setup
logger = logging.getLogger("scraper")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)

# Usage
logger.info("Scrape complete", extra={
    "url": "https://example.com",
    "status_code": 200,
    "items_count": 42,
})

Tracking Metrics

Build a simple metrics tracker for your scraper:

import time
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ScrapeMetrics:
    total_requests: int = 0
    successful: int = 0
    failed: int = 0
    items_scraped: int = 0
    start_time: float = field(default_factory=time.time)
    errors: list = field(default_factory=list)

    @property
    def success_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.successful / self.total_requests * 100

    @property
    def duration(self) -> float:
        return time.time() - self.start_time

    def record_success(self, items: int = 0):
        self.total_requests += 1
        self.successful += 1
        self.items_scraped += items

    def record_failure(self, error: str):
        self.total_requests += 1
        self.failed += 1
        self.errors.append(error)

    def summary(self) -> dict:
        return {
            "total_requests": self.total_requests,
            "success_rate": f"{self.success_rate:.1f}%",
            "items_scraped": self.items_scraped,
            "failures": self.failed,
            "duration_seconds": f"{self.duration:.1f}",
        }

# Usage
metrics = ScrapeMetrics()

for url in urls:
    try:
        items = scrape(url)
        metrics.record_success(items=len(items))
    except Exception as e:
        metrics.record_failure(str(e))

print(metrics.summary())

Slack Alerts on Failure

Send a Slack notification when your scraper fails:

import requests

SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

def send_slack_alert(message: str, level: str = "warning"):
    color = {"warning": "#ff9900", "error": "#ff0000", "info": "#36a64f"}
    payload = {
        "attachments": [{
            "color": color.get(level, "#cccccc"),
            "title": "Scraper Alert",
            "text": message,
            "footer": "Scraping Central Monitor",
        }]
    }
    requests.post(SLACK_WEBHOOK, json=payload, timeout=10)

# Alert when success rate drops
if metrics.success_rate < 90:
    send_slack_alert(
        f"Success rate dropped to {metrics.success_rate:.1f}%\n"
        f"Failed: {metrics.failed}/{metrics.total_requests}\n"
        f"Recent errors: {', '.join(metrics.errors[-3:])}",
        level="error",
    )

Health Check Endpoint

If your scraper runs as a web service, add a health check:

from flask import Flask, jsonify
from datetime import datetime

app = Flask(__name__)
last_successful_scrape = None

@app.route("/health")
def health():
    if last_successful_scrape is None:
        return jsonify({"status": "starting"}), 503

    age = (datetime.utcnow() - last_successful_scrape).total_seconds()
    if age > 7200:  # No successful scrape in 2 hours
        return jsonify({"status": "stale", "last_scrape_age_seconds": age}), 503

    return jsonify({"status": "healthy", "last_scrape_age_seconds": age})

Monitoring Checklist

  • Log every request with URL, status code, and duration
  • Track success rate and alert when it drops below your threshold
  • Monitor scraped item count to detect site structure changes
  • Set up alerting (Slack, email, PagerDuty) for critical failures
  • Keep at least 7 days of logs for debugging