Structured Logging (JSON Logs in Python and PHP), Production, Scale & Career

Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.

When a scraper runs once and you're watching the terminal, print is fine. When a scraper runs continuously across 50 worker pods, every log line ships to Loki or Elasticsearch, and an on-call engineer needs to find why 0.3% of requests are failing, print is a tragedy.

The fix is structured logging: every event is a JSON object with named fields. Search becomes "filter by status_code = 503" instead of grepping unstructured strings.

Free-text vs structured

Free-text:

2026-05-12 14:23:01 INFO Fetched https://practice.scrapingcentral.com/products/42 in 230ms with status 200

Structured:

{"ts":"2026-05-12T14:23:01Z","level":"info","event":"fetch","url":"https://practice.scrapingcentral.com/products/42","duration_ms":230,"status":200,"spider":"products","worker_id":"w-7"}

The second can be queried: status:>=500 AND spider:products | stats count by url returns failure rates per URL in seconds. The first requires regex.

Python, standard logging with python-json-logger

import logging
from pythonjsonlogger import jsonlogger

logger = logging.getLogger("scraper")
handler = logging.StreamHandler()
fmt = jsonlogger.JsonFormatter(
  "%(asctime)s %(levelname)s %(name)s %(message)s",
  rename_fields={"asctime": "ts", "levelname": "level"}
)
handler.setFormatter(fmt)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

logger.info("fetch", extra={
  "url": "https://practice.scrapingcentral.com/products/42",
  "duration_ms": 230,
  "status": 200,
  "spider": "products",
})

Output is JSON to stdout. Containers and log shippers (Vector, Promtail) pick it up.

Or structlog (preferred for new code)

import structlog

structlog.configure(processors=[
  structlog.processors.TimeStamper(fmt="iso"),
  structlog.processors.add_log_level,
  structlog.processors.JSONRenderer(),
])
log = structlog.get_logger()
log = log.bind(spider="products", worker_id="w-7")

log.info("fetch", url=url, duration_ms=230, status=200)

bind() lets you attach context once and have every subsequent call include it. Excellent for per-request context.

PHP, Monolog with the JsonFormatter

use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Formatter\JsonFormatter;
use Monolog\Processor\PsrLogMessageProcessor;

$log = new Logger('scraper');
$handler = new StreamHandler('php://stdout', Logger::INFO);
$handler->setFormatter(new JsonFormatter(JsonFormatter::BATCH_MODE_NEWLINES));
$log->pushHandler($handler);

$log->info('fetch', [
  'url' => 'https://practice.scrapingcentral.com/products/42',
  'duration_ms' => 230,
  'status' => 200,
  'spider' => 'products',
]);

In Symfony, monolog.yaml:

monolog:
  handlers:
  main:
  type: stream
  path: "php://stdout"
  level: info
  formatter: monolog.formatter.json

Use Monolog's processors to enrich every log with context (request ID, user, hostname) automatically.

Always include

Make these fields ubiquitous across every log line:

Field	Why
`ts`	Timestamp, ISO 8601 with timezone
`level`	info / warn / error
`event`	A short type identifier (`fetch`, `parse`, `retry`, `dead_letter`)
`spider` / `job`	Which scraper
`run_id`	A UUID per scraper run, groups all related logs
`worker_id`	Which worker process emitted it
`url` (when applicable)	What was being fetched

Adding run_id is the single most useful upgrade. Filter by run ID and you see exactly what one execution did.

Log levels, disciplined use

Level	When
`debug`	Per-request detail you can turn off in prod
`info`	Normal lifecycle: started, fetched, finished
`warn`	Recoverable issues: retry, fallback, slow response
`error`	A real failure that needs attention (in moderation, flooding error logs trains people to ignore them)
`critical`	System-level failure: data store unreachable, ran out of disk

Resist using error for transient retries. Production teams calibrate alerts to error+; cry-wolf logging makes the next real issue invisible.

Sensitive data

Never log:

Plain credentials (proxy URLs with embedded user:pass, API keys).
PII you scraped (emails, phone numbers).
Full request bodies for endpoints that contain secrets.

Mask in code, or use Monolog/structlog processors that scrub specific keys.

Sampling at scale

A 50-worker scraper running 100 req/s logs 360k lines/hour. If most are uninteresting INFO fetch events, sample them:

import random
if random.random() < 0.01:  # 1% sample
  log.info("fetch", url=url, duration_ms=ms, status=status)

Errors and warns log at 100%. The expensive info traffic samples down. Loki / Elasticsearch costs drop proportionally.

Local readability

JSON is ugly to read raw. Pipe through jq for local dev:

python scraper.py | jq -C 'select(.level == "error")'

Or use a dev-only formatter:

if os.environ.get("ENV") == "dev":
  structlog.configure(processors=[..., structlog.dev.ConsoleRenderer()])
else:
  structlog.configure(processors=[..., structlog.processors.JSONRenderer()])

Hands-on lab

Add structured logging to your Catalog108 scraper. Log every fetch with url, status, duration_ms, and a run_id. Run the scraper, pipe stdout to a file, and answer two questions with jq queries:

What was the slowest URL this run?
Which URLs returned 5xx?

If both are one-line jq filters, you've structured the logs well.

Structured Logging (JSON Logs in Python and PHP)

What you’ll learn