Centralized Logging (Loki, Elasticsearch), Production, Scale & Career

Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.

A single-host scraper logs to a file. A fleet of scrapers across containers and hosts needs a central destination. Two open-source projects dominate this space: Grafana Loki and Elasticsearch (the ELK / OpenSearch family).

Loki vs Elasticsearch

Aspect	Loki	Elasticsearch
Indexing	Indexes labels only; log lines themselves are not full-text indexed	Full-text indexes every field
Storage cost	Very low, log lines stored compressed in S3/MinIO	Higher, full inverted index doubles or triples storage
Query speed (label filter)	Fast	Fast
Query speed (free-text grep through millions of lines)	Slower	Faster
Operational complexity	Simple	More moving parts (master, data, ingest nodes)
UI	Grafana	Kibana / OpenSearch Dashboards
Good fit	"Filter by labels first, grep the rest"	"Search any text across everything"

For scraping workloads, Loki is usually the right default. You'll filter by labels (spider, env, level) before grepping. Elasticsearch's full-text power matters less when your logs are structured JSON.

The pipeline

Logs flow:

Scraper (stdout JSON) → Log shipper → Storage backend → Query UI
  (Promtail, Vector, Fluent Bit)

Containers write JSON to stdout. The container runtime captures it. A shipper reads it, attaches metadata (pod name, namespace, label set), and forwards to Loki or Elasticsearch.

Promtail for Loki

A minimal promtail-config.yaml:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: scrapers
  static_configs:
  - targets: [localhost]
  labels:
  job: scrapers
  __path__: /var/log/scrapers/*.log
  pipeline_stages:
  - json:
  expressions:
  level: level
  spider: spider
  url: url
  - labels:
  level:
  spider:

Two important rules:

Promote a few fields to labels (level, spider, env). These are indexed; queries by label are cheap.
Don't promote high-cardinality fields (URL, run_id, IP). Loki indexes labels in memory; cardinality explosion ruins it. Keep URL and run_id inside the log line and grep for them via LogQL.

LogQL, the query language

LogQL feels like PromQL.

# Errors from product spider in the last hour
{spider="products", level="error"}

# 5xx responses, grepping for the status field
{spider="products"} |= "\"status\":5"

# Top 10 most-failed URLs
sum by (url) (
  count_over_time({spider="products", level="error"} | json | line_format "{{.url}}" [1h])
)

Filters by label first (fast), then text-greps within the matched stream (still fast since each label set is a small slice).

Vector, the modern shipper

Promtail is fine; Vector is more flexible. It can route to Loki, Elasticsearch, Kafka, and S3 simultaneously. A scrapers' Vector config:

[sources.scraper_logs]
type = "kubernetes_logs"

[transforms.parse_json]
type = "remap"
inputs = ["scraper_logs"]
source = '''
  parsed = parse_json!(string!(.message))
  . = merge!(., parsed)
'''

[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "http://loki:3100"
labels.spider = "{{ spider }}"
labels.level = "{{ level }}"

[sinks.cold_archive]
type = "aws_s3"
inputs = ["parse_json"]
bucket = "scraper-cold-logs"
compression = "gzip"

Same logs, two destinations: Loki for queries, S3 for long-term cold archive.

Retention

Logs aren't free. A typical sane retention:

Tier	Where	Duration
Hot, queryable	Loki / Elasticsearch	7–30 days
Cold, archived	S3 gzipped	1–3 years
Compliance / audit	S3 Glacier	As required

Loki's S3-backed chunk storage already amortizes well, pushing to ~30 days hot is realistic on modest budgets. Elasticsearch retention is more expensive; closing or rolling old indices to cheap storage tiers matters.

Elasticsearch when it's right

ES becomes the better choice when:

You need free-text search across log content frequently.
You already run ES for application search and don't want a second system.
You want rich aggregations / dashboards in Kibana.

Be aware of operational cost: ES masters, dedicated data nodes, hot-warm-cold tiering, index lifecycle management. It's a real system to operate.

Symfony + Monolog → Loki direct

Monolog has a Loki handler. For non-containerised PHP, you can push directly:

use Itspire\MonologLoki\Handler\LokiHandler;
$handler = new LokiHandler([
  'entrypoint' => 'http://loki:3100',
  'context' => ['service' => 'scraper'],
]);
$log->pushHandler($handler);

For container deployments, prefer stdout + shipper, fewer dependencies in the app.

Costs at scale

Rough order of magnitude for 100GB/day of logs:

Loki on a single 4-vCPU/16GB node + S3: ~$50–100/month.
Elasticsearch on managed (e.g. Elastic Cloud, OpenSearch managed): $500–2000/month depending on retention.
Self-hosted ES on big VMs: $200–600/month plus your operational time.

Cost discipline: sample low-value INFO logs (lesson 55), retain less than you think you need (you almost never query >30-day-old logs), and offload to S3.

Hands-on lab

Run Loki + Promtail + Grafana with docker-compose (sample stacks are widely available). Point Promtail at your Catalog108 scraper's stdout. Query:

Errors in the last hour, by URL.
Average request duration over time (avg_over_time on duration_ms).
Logs from a specific run_id.

You'll spend more time tuning labels than running queries, which is the right ratio.

Centralized Logging (Loki, Elasticsearch)

What you’ll learn