Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.56intermediate5 min read

Centralized Logging (Loki, Elasticsearch)

Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.

What you’ll learn

  • Compare Loki and Elasticsearch for scraper logs.
  • Set up a log-shipping pipeline from a containerised scraper.
  • Choose retention and labelling strategies that don't bankrupt you.

A single-host scraper logs to a file. A fleet of scrapers across containers and hosts needs a central destination. Two open-source projects dominate this space: Grafana Loki and Elasticsearch (the ELK / OpenSearch family).

Loki vs Elasticsearch

Aspect Loki Elasticsearch
Indexing Indexes labels only; log lines themselves are not full-text indexed Full-text indexes every field
Storage cost Very low, log lines stored compressed in S3/MinIO Higher, full inverted index doubles or triples storage
Query speed (label filter) Fast Fast
Query speed (free-text grep through millions of lines) Slower Faster
Operational complexity Simple More moving parts (master, data, ingest nodes)
UI Grafana Kibana / OpenSearch Dashboards
Good fit "Filter by labels first, grep the rest" "Search any text across everything"

For scraping workloads, Loki is usually the right default. You'll filter by labels (spider, env, level) before grepping. Elasticsearch's full-text power matters less when your logs are structured JSON.

The pipeline

Logs flow:

Scraper (stdout JSON) → Log shipper → Storage backend → Query UI
  (Promtail, Vector, Fluent Bit)

Containers write JSON to stdout. The container runtime captures it. A shipper reads it, attaches metadata (pod name, namespace, label set), and forwards to Loki or Elasticsearch.

Promtail for Loki

A minimal promtail-config.yaml:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: scrapers
  static_configs:
  - targets: [localhost]
  labels:
  job: scrapers
  __path__: /var/log/scrapers/*.log
  pipeline_stages:
  - json:
  expressions:
  level: level
  spider: spider
  url: url
  - labels:
  level:
  spider:

Two important rules:

  1. Promote a few fields to labels (level, spider, env). These are indexed; queries by label are cheap.
  2. Don't promote high-cardinality fields (URL, run_id, IP). Loki indexes labels in memory; cardinality explosion ruins it. Keep URL and run_id inside the log line and grep for them via LogQL.

LogQL, the query language

LogQL feels like PromQL.

# Errors from product spider in the last hour
{spider="products", level="error"}

# 5xx responses, grepping for the status field
{spider="products"} |= "\"status\":5"

# Top 10 most-failed URLs
sum by (url) (
  count_over_time({spider="products", level="error"} | json | line_format "{{.url}}" [1h])
)

Filters by label first (fast), then text-greps within the matched stream (still fast since each label set is a small slice).

Vector, the modern shipper

Promtail is fine; Vector is more flexible. It can route to Loki, Elasticsearch, Kafka, and S3 simultaneously. A scrapers' Vector config:

[sources.scraper_logs]
type = "kubernetes_logs"

[transforms.parse_json]
type = "remap"
inputs = ["scraper_logs"]
source = '''
  parsed = parse_json!(string!(.message))
  . = merge!(., parsed)
'''

[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "http://loki:3100"
labels.spider = "{{ spider }}"
labels.level = "{{ level }}"

[sinks.cold_archive]
type = "aws_s3"
inputs = ["parse_json"]
bucket = "scraper-cold-logs"
compression = "gzip"

Same logs, two destinations: Loki for queries, S3 for long-term cold archive.

Retention

Logs aren't free. A typical sane retention:

Tier Where Duration
Hot, queryable Loki / Elasticsearch 7–30 days
Cold, archived S3 gzipped 1–3 years
Compliance / audit S3 Glacier As required

Loki's S3-backed chunk storage already amortizes well, pushing to ~30 days hot is realistic on modest budgets. Elasticsearch retention is more expensive; closing or rolling old indices to cheap storage tiers matters.

Elasticsearch when it's right

ES becomes the better choice when:

  • You need free-text search across log content frequently.
  • You already run ES for application search and don't want a second system.
  • You want rich aggregations / dashboards in Kibana.

Be aware of operational cost: ES masters, dedicated data nodes, hot-warm-cold tiering, index lifecycle management. It's a real system to operate.

Symfony + Monolog → Loki direct

Monolog has a Loki handler. For non-containerised PHP, you can push directly:

use Itspire\MonologLoki\Handler\LokiHandler;
$handler = new LokiHandler([
  'entrypoint' => 'http://loki:3100',
  'context' => ['service' => 'scraper'],
]);
$log->pushHandler($handler);

For container deployments, prefer stdout + shipper, fewer dependencies in the app.

Costs at scale

Rough order of magnitude for 100GB/day of logs:

  • Loki on a single 4-vCPU/16GB node + S3: ~$50–100/month.
  • Elasticsearch on managed (e.g. Elastic Cloud, OpenSearch managed): $500–2000/month depending on retention.
  • Self-hosted ES on big VMs: $200–600/month plus your operational time.

Cost discipline: sample low-value INFO logs (lesson 55), retain less than you think you need (you almost never query >30-day-old logs), and offload to S3.

Hands-on lab

Run Loki + Promtail + Grafana with docker-compose (sample stacks are widely available). Point Promtail at your Catalog108 scraper's stdout. Query:

  1. Errors in the last hour, by URL.
  2. Average request duration over time (avg_over_time on duration_ms).
  3. Logs from a specific run_id.

You'll spend more time tuning labels than running queries, which is the right ratio.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Centralized Logging (Loki, Elasticsearch)1 / 8

What's the architectural difference between Loki and Elasticsearch?

Score so far: 0 / 0