Centralized Logging (Loki, Elasticsearch)
Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.
What you’ll learn
- Compare Loki and Elasticsearch for scraper logs.
- Set up a log-shipping pipeline from a containerised scraper.
- Choose retention and labelling strategies that don't bankrupt you.
A single-host scraper logs to a file. A fleet of scrapers across containers and hosts needs a central destination. Two open-source projects dominate this space: Grafana Loki and Elasticsearch (the ELK / OpenSearch family).
Loki vs Elasticsearch
| Aspect | Loki | Elasticsearch |
|---|---|---|
| Indexing | Indexes labels only; log lines themselves are not full-text indexed | Full-text indexes every field |
| Storage cost | Very low, log lines stored compressed in S3/MinIO | Higher, full inverted index doubles or triples storage |
| Query speed (label filter) | Fast | Fast |
| Query speed (free-text grep through millions of lines) | Slower | Faster |
| Operational complexity | Simple | More moving parts (master, data, ingest nodes) |
| UI | Grafana | Kibana / OpenSearch Dashboards |
| Good fit | "Filter by labels first, grep the rest" | "Search any text across everything" |
For scraping workloads, Loki is usually the right default. You'll filter by labels (spider, env, level) before grepping. Elasticsearch's full-text power matters less when your logs are structured JSON.
The pipeline
Logs flow:
Scraper (stdout JSON) → Log shipper → Storage backend → Query UI
(Promtail, Vector, Fluent Bit)
Containers write JSON to stdout. The container runtime captures it. A shipper reads it, attaches metadata (pod name, namespace, label set), and forwards to Loki or Elasticsearch.
Promtail for Loki
A minimal promtail-config.yaml:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: scrapers
static_configs:
- targets: [localhost]
labels:
job: scrapers
__path__: /var/log/scrapers/*.log
pipeline_stages:
- json:
expressions:
level: level
spider: spider
url: url
- labels:
level:
spider:
Two important rules:
- Promote a few fields to labels (level, spider, env). These are indexed; queries by label are cheap.
- Don't promote high-cardinality fields (URL, run_id, IP). Loki indexes labels in memory; cardinality explosion ruins it. Keep URL and run_id inside the log line and grep for them via LogQL.
LogQL, the query language
LogQL feels like PromQL.
# Errors from product spider in the last hour
{spider="products", level="error"}
# 5xx responses, grepping for the status field
{spider="products"} |= "\"status\":5"
# Top 10 most-failed URLs
sum by (url) (
count_over_time({spider="products", level="error"} | json | line_format "{{.url}}" [1h])
)
Filters by label first (fast), then text-greps within the matched stream (still fast since each label set is a small slice).
Vector, the modern shipper
Promtail is fine; Vector is more flexible. It can route to Loki, Elasticsearch, Kafka, and S3 simultaneously. A scrapers' Vector config:
[sources.scraper_logs]
type = "kubernetes_logs"
[transforms.parse_json]
type = "remap"
inputs = ["scraper_logs"]
source = '''
parsed = parse_json!(string!(.message))
. = merge!(., parsed)
'''
[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "http://loki:3100"
labels.spider = "{{ spider }}"
labels.level = "{{ level }}"
[sinks.cold_archive]
type = "aws_s3"
inputs = ["parse_json"]
bucket = "scraper-cold-logs"
compression = "gzip"
Same logs, two destinations: Loki for queries, S3 for long-term cold archive.
Retention
Logs aren't free. A typical sane retention:
| Tier | Where | Duration |
|---|---|---|
| Hot, queryable | Loki / Elasticsearch | 7–30 days |
| Cold, archived | S3 gzipped | 1–3 years |
| Compliance / audit | S3 Glacier | As required |
Loki's S3-backed chunk storage already amortizes well, pushing to ~30 days hot is realistic on modest budgets. Elasticsearch retention is more expensive; closing or rolling old indices to cheap storage tiers matters.
Elasticsearch when it's right
ES becomes the better choice when:
- You need free-text search across log content frequently.
- You already run ES for application search and don't want a second system.
- You want rich aggregations / dashboards in Kibana.
Be aware of operational cost: ES masters, dedicated data nodes, hot-warm-cold tiering, index lifecycle management. It's a real system to operate.
Symfony + Monolog → Loki direct
Monolog has a Loki handler. For non-containerised PHP, you can push directly:
use Itspire\MonologLoki\Handler\LokiHandler;
$handler = new LokiHandler([
'entrypoint' => 'http://loki:3100',
'context' => ['service' => 'scraper'],
]);
$log->pushHandler($handler);
For container deployments, prefer stdout + shipper, fewer dependencies in the app.
Costs at scale
Rough order of magnitude for 100GB/day of logs:
- Loki on a single 4-vCPU/16GB node + S3: ~$50–100/month.
- Elasticsearch on managed (e.g. Elastic Cloud, OpenSearch managed): $500–2000/month depending on retention.
- Self-hosted ES on big VMs: $200–600/month plus your operational time.
Cost discipline: sample low-value INFO logs (lesson 55), retain less than you think you need (you almost never query >30-day-old logs), and offload to S3.
Hands-on lab
Run Loki + Promtail + Grafana with docker-compose (sample stacks are widely available). Point Promtail at your Catalog108 scraper's stdout. Query:
- Errors in the last hour, by URL.
- Average request duration over time (
avg_over_timeonduration_ms). - Logs from a specific
run_id.
You'll spend more time tuning labels than running queries, which is the right ratio.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.