Monitoring Scrapers - Logging and Alerts
Set up logging, monitoring, and alerting for your web scrapers to catch failures before they become data gaps.
Deployment · #7intermediate3 min read
A scraper that silently fails is worse than one that loudly crashes. Proper monitoring ensures you know the moment something goes wrong, before missing data becomes a problem.
What to Monitor
| Metric | Why It Matters |
|---|---|
| Success rate | Dropping below 95% signals blocks or site changes |
| Response time | Sudden spikes may indicate throttling |
| Items scraped | Zero items usually means the page structure changed |
| Error count | Spike in errors means something broke |
| Proxy health | Track which proxies are working |
Structured Logging
Use structured logging so you can query and filter logs:
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"module": record.module,
}
if hasattr(record, "url"):
log_entry["url"] = record.url
if hasattr(record, "status_code"):
log_entry["status_code"] = record.status_code
if hasattr(record, "items_count"):
log_entry["items_count"] = record.items_count
return json.dumps(log_entry)
# Setup
logger = logging.getLogger("scraper")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
# Usage
logger.info("Scrape complete", extra={
"url": "https://example.com",
"status_code": 200,
"items_count": 42,
})
Tracking Metrics
Build a simple metrics tracker for your scraper:
import time
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ScrapeMetrics:
total_requests: int = 0
successful: int = 0
failed: int = 0
items_scraped: int = 0
start_time: float = field(default_factory=time.time)
errors: list = field(default_factory=list)
@property
def success_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return self.successful / self.total_requests * 100
@property
def duration(self) -> float:
return time.time() - self.start_time
def record_success(self, items: int = 0):
self.total_requests += 1
self.successful += 1
self.items_scraped += items
def record_failure(self, error: str):
self.total_requests += 1
self.failed += 1
self.errors.append(error)
def summary(self) -> dict:
return {
"total_requests": self.total_requests,
"success_rate": f"{self.success_rate:.1f}%",
"items_scraped": self.items_scraped,
"failures": self.failed,
"duration_seconds": f"{self.duration:.1f}",
}
# Usage
metrics = ScrapeMetrics()
for url in urls:
try:
items = scrape(url)
metrics.record_success(items=len(items))
except Exception as e:
metrics.record_failure(str(e))
print(metrics.summary())
Slack Alerts on Failure
Send a Slack notification when your scraper fails:
import requests
SLACK_WEBHOOK = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
def send_slack_alert(message: str, level: str = "warning"):
color = {"warning": "#ff9900", "error": "#ff0000", "info": "#36a64f"}
payload = {
"attachments": [{
"color": color.get(level, "#cccccc"),
"title": "Scraper Alert",
"text": message,
"footer": "Scraping Central Monitor",
}]
}
requests.post(SLACK_WEBHOOK, json=payload, timeout=10)
# Alert when success rate drops
if metrics.success_rate < 90:
send_slack_alert(
f"Success rate dropped to {metrics.success_rate:.1f}%\n"
f"Failed: {metrics.failed}/{metrics.total_requests}\n"
f"Recent errors: {', '.join(metrics.errors[-3:])}",
level="error",
)
Health Check Endpoint
If your scraper runs as a web service, add a health check:
from flask import Flask, jsonify
from datetime import datetime
app = Flask(__name__)
last_successful_scrape = None
@app.route("/health")
def health():
if last_successful_scrape is None:
return jsonify({"status": "starting"}), 503
age = (datetime.utcnow() - last_successful_scrape).total_seconds()
if age > 7200: # No successful scrape in 2 hours
return jsonify({"status": "stale", "last_scrape_age_seconds": age}), 503
return jsonify({"status": "healthy", "last_scrape_age_seconds": age})
Monitoring Checklist
- Log every request with URL, status code, and duration
- Track success rate and alert when it drops below your threshold
- Monitor scraped item count to detect site structure changes
- Set up alerting (Slack, email, PagerDuty) for critical failures
- Keep at least 7 days of logs for debugging