Structured Logging (JSON Logs in Python and PHP)
Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.
What you’ll learn
- Explain why structured logs beat free-text logs in production.
- Configure JSON logging in Python (structlog / standard logging).
- Configure JSON logging in Symfony / Monolog.
When a scraper runs once and you're watching the terminal, print is fine. When a scraper runs continuously across 50 worker pods, every log line ships to Loki or Elasticsearch, and an on-call engineer needs to find why 0.3% of requests are failing, print is a tragedy.
The fix is structured logging: every event is a JSON object with named fields. Search becomes "filter by status_code = 503" instead of grepping unstructured strings.
Free-text vs structured
Free-text:
2026-05-12 14:23:01 INFO Fetched https://practice.scrapingcentral.com/products/42 in 230ms with status 200
Structured:
{"ts":"2026-05-12T14:23:01Z","level":"info","event":"fetch","url":"https://practice.scrapingcentral.com/products/42","duration_ms":230,"status":200,"spider":"products","worker_id":"w-7"}
The second can be queried: status:>=500 AND spider:products | stats count by url returns failure rates per URL in seconds. The first requires regex.
Python, standard logging with python-json-logger
import logging
from pythonjsonlogger import jsonlogger
logger = logging.getLogger("scraper")
handler = logging.StreamHandler()
fmt = jsonlogger.JsonFormatter(
"%(asctime)s %(levelname)s %(name)s %(message)s",
rename_fields={"asctime": "ts", "levelname": "level"}
)
handler.setFormatter(fmt)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
logger.info("fetch", extra={
"url": "https://practice.scrapingcentral.com/products/42",
"duration_ms": 230,
"status": 200,
"spider": "products",
})
Output is JSON to stdout. Containers and log shippers (Vector, Promtail) pick it up.
Or structlog (preferred for new code)
import structlog
structlog.configure(processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
])
log = structlog.get_logger()
log = log.bind(spider="products", worker_id="w-7")
log.info("fetch", url=url, duration_ms=230, status=200)
bind() lets you attach context once and have every subsequent call include it. Excellent for per-request context.
PHP, Monolog with the JsonFormatter
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Formatter\JsonFormatter;
use Monolog\Processor\PsrLogMessageProcessor;
$log = new Logger('scraper');
$handler = new StreamHandler('php://stdout', Logger::INFO);
$handler->setFormatter(new JsonFormatter(JsonFormatter::BATCH_MODE_NEWLINES));
$log->pushHandler($handler);
$log->info('fetch', [
'url' => 'https://practice.scrapingcentral.com/products/42',
'duration_ms' => 230,
'status' => 200,
'spider' => 'products',
]);
In Symfony, monolog.yaml:
monolog:
handlers:
main:
type: stream
path: "php://stdout"
level: info
formatter: monolog.formatter.json
Use Monolog's processors to enrich every log with context (request ID, user, hostname) automatically.
Always include
Make these fields ubiquitous across every log line:
| Field | Why |
|---|---|
ts |
Timestamp, ISO 8601 with timezone |
level |
info / warn / error |
event |
A short type identifier (fetch, parse, retry, dead_letter) |
spider / job |
Which scraper |
run_id |
A UUID per scraper run, groups all related logs |
worker_id |
Which worker process emitted it |
url (when applicable) |
What was being fetched |
Adding run_id is the single most useful upgrade. Filter by run ID and you see exactly what one execution did.
Log levels, disciplined use
| Level | When |
|---|---|
debug |
Per-request detail you can turn off in prod |
info |
Normal lifecycle: started, fetched, finished |
warn |
Recoverable issues: retry, fallback, slow response |
error |
A real failure that needs attention (in moderation, flooding error logs trains people to ignore them) |
critical |
System-level failure: data store unreachable, ran out of disk |
Resist using error for transient retries. Production teams calibrate alerts to error+; cry-wolf logging makes the next real issue invisible.
Sensitive data
Never log:
- Plain credentials (proxy URLs with embedded user:pass, API keys).
- PII you scraped (emails, phone numbers).
- Full request bodies for endpoints that contain secrets.
Mask in code, or use Monolog/structlog processors that scrub specific keys.
Sampling at scale
A 50-worker scraper running 100 req/s logs 360k lines/hour. If most are uninteresting INFO fetch events, sample them:
import random
if random.random() < 0.01: # 1% sample
log.info("fetch", url=url, duration_ms=ms, status=status)
Errors and warns log at 100%. The expensive info traffic samples down. Loki / Elasticsearch costs drop proportionally.
Local readability
JSON is ugly to read raw. Pipe through jq for local dev:
python scraper.py | jq -C 'select(.level == "error")'
Or use a dev-only formatter:
if os.environ.get("ENV") == "dev":
structlog.configure(processors=[..., structlog.dev.ConsoleRenderer()])
else:
structlog.configure(processors=[..., structlog.processors.JSONRenderer()])
Hands-on lab
Add structured logging to your Catalog108 scraper. Log every fetch with url, status, duration_ms, and a run_id. Run the scraper, pipe stdout to a file, and answer two questions with jq queries:
- What was the slowest URL this run?
- Which URLs returned 5xx?
If both are one-line jq filters, you've structured the logs well.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.