Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.55intermediate4 min read

Structured Logging (JSON Logs in Python and PHP)

Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.

What you’ll learn

  • Explain why structured logs beat free-text logs in production.
  • Configure JSON logging in Python (structlog / standard logging).
  • Configure JSON logging in Symfony / Monolog.

When a scraper runs once and you're watching the terminal, print is fine. When a scraper runs continuously across 50 worker pods, every log line ships to Loki or Elasticsearch, and an on-call engineer needs to find why 0.3% of requests are failing, print is a tragedy.

The fix is structured logging: every event is a JSON object with named fields. Search becomes "filter by status_code = 503" instead of grepping unstructured strings.

Free-text vs structured

Free-text:

2026-05-12 14:23:01 INFO Fetched https://practice.scrapingcentral.com/products/42 in 230ms with status 200

Structured:

{"ts":"2026-05-12T14:23:01Z","level":"info","event":"fetch","url":"https://practice.scrapingcentral.com/products/42","duration_ms":230,"status":200,"spider":"products","worker_id":"w-7"}

The second can be queried: status:>=500 AND spider:products | stats count by url returns failure rates per URL in seconds. The first requires regex.

Python, standard logging with python-json-logger

import logging
from pythonjsonlogger import jsonlogger

logger = logging.getLogger("scraper")
handler = logging.StreamHandler()
fmt = jsonlogger.JsonFormatter(
  "%(asctime)s %(levelname)s %(name)s %(message)s",
  rename_fields={"asctime": "ts", "levelname": "level"}
)
handler.setFormatter(fmt)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

logger.info("fetch", extra={
  "url": "https://practice.scrapingcentral.com/products/42",
  "duration_ms": 230,
  "status": 200,
  "spider": "products",
})

Output is JSON to stdout. Containers and log shippers (Vector, Promtail) pick it up.

Or structlog (preferred for new code)

import structlog

structlog.configure(processors=[
  structlog.processors.TimeStamper(fmt="iso"),
  structlog.processors.add_log_level,
  structlog.processors.JSONRenderer(),
])
log = structlog.get_logger()
log = log.bind(spider="products", worker_id="w-7")

log.info("fetch", url=url, duration_ms=230, status=200)

bind() lets you attach context once and have every subsequent call include it. Excellent for per-request context.

PHP, Monolog with the JsonFormatter

use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Formatter\JsonFormatter;
use Monolog\Processor\PsrLogMessageProcessor;

$log = new Logger('scraper');
$handler = new StreamHandler('php://stdout', Logger::INFO);
$handler->setFormatter(new JsonFormatter(JsonFormatter::BATCH_MODE_NEWLINES));
$log->pushHandler($handler);

$log->info('fetch', [
  'url' => 'https://practice.scrapingcentral.com/products/42',
  'duration_ms' => 230,
  'status' => 200,
  'spider' => 'products',
]);

In Symfony, monolog.yaml:

monolog:
  handlers:
  main:
  type: stream
  path: "php://stdout"
  level: info
  formatter: monolog.formatter.json

Use Monolog's processors to enrich every log with context (request ID, user, hostname) automatically.

Always include

Make these fields ubiquitous across every log line:

Field Why
ts Timestamp, ISO 8601 with timezone
level info / warn / error
event A short type identifier (fetch, parse, retry, dead_letter)
spider / job Which scraper
run_id A UUID per scraper run, groups all related logs
worker_id Which worker process emitted it
url (when applicable) What was being fetched

Adding run_id is the single most useful upgrade. Filter by run ID and you see exactly what one execution did.

Log levels, disciplined use

Level When
debug Per-request detail you can turn off in prod
info Normal lifecycle: started, fetched, finished
warn Recoverable issues: retry, fallback, slow response
error A real failure that needs attention (in moderation, flooding error logs trains people to ignore them)
critical System-level failure: data store unreachable, ran out of disk

Resist using error for transient retries. Production teams calibrate alerts to error+; cry-wolf logging makes the next real issue invisible.

Sensitive data

Never log:

  • Plain credentials (proxy URLs with embedded user:pass, API keys).
  • PII you scraped (emails, phone numbers).
  • Full request bodies for endpoints that contain secrets.

Mask in code, or use Monolog/structlog processors that scrub specific keys.

Sampling at scale

A 50-worker scraper running 100 req/s logs 360k lines/hour. If most are uninteresting INFO fetch events, sample them:

import random
if random.random() < 0.01:  # 1% sample
  log.info("fetch", url=url, duration_ms=ms, status=status)

Errors and warns log at 100%. The expensive info traffic samples down. Loki / Elasticsearch costs drop proportionally.

Local readability

JSON is ugly to read raw. Pipe through jq for local dev:

python scraper.py | jq -C 'select(.level == "error")'

Or use a dev-only formatter:

if os.environ.get("ENV") == "dev":
  structlog.configure(processors=[..., structlog.dev.ConsoleRenderer()])
else:
  structlog.configure(processors=[..., structlog.processors.JSONRenderer()])

Hands-on lab

Add structured logging to your Catalog108 scraper. Log every fetch with url, status, duration_ms, and a run_id. Run the scraper, pipe stdout to a file, and answer two questions with jq queries:

  1. What was the slowest URL this run?
  2. Which URLs returned 5xx?

If both are one-line jq filters, you've structured the logs well.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Structured Logging (JSON Logs in Python and PHP)1 / 8

What's the primary advantage of JSON structured logs over free-text logs in production?

Score so far: 0 / 0