Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.64advanced5 min read

Kubernetes for Scraper Workloads (Overview)

What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.

What you’ll learn

  • Name the core Kubernetes resources used by a scraper.
  • Decide between Deployment, StatefulSet, Job, and CronJob for scraper variants.
  • Write a minimal manifest for a scraper + worker fleet.

Kubernetes is overkill for a single scraper on one VM. It earns its slot when you have:

  • Multiple scraper fleets (different sites, different proxy configs).
  • Worker counts that scale up and down (e.g. a daily burst).
  • Multi-host deployments.
  • A team that already runs K8s for other workloads.

This lesson is a tour of what you need to know to run scrapers on K8s, not a Kubernetes course.

The objects you'll actually use

Object What it is Scraper use
Pod One or more co-located containers The unit of execution; you rarely create these directly
Deployment Declarative spec for N replicas of a Pod Long-running scrapers / workers
StatefulSet Like Deployment, but with stable identities and persistent volumes Stateful workers (rare for scrapers)
Job One-shot pods that run to completion Run a backfill once
CronJob Scheduled Jobs Daily / hourly scrapes
Service Stable DNS / load-balancer for a set of pods Expose a scraper's /metrics endpoint
ConfigMap Non-secret config spider settings, target lists
Secret Sensitive config API keys, proxy credentials
HorizontalPodAutoscaler Scale replicas based on metrics Auto-grow worker fleet by queue depth

90% of scraper workloads fit into Deployment + CronJob + Service + ConfigMap + Secret.

A scraper worker Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-worker
  labels: {app: scraper, role: worker}
spec:
  replicas: 4
  selector:
  matchLabels: {app: scraper, role: worker}
  template:
  metadata:
  labels: {app: scraper, role: worker}
  spec:
  containers:
  - name: worker
  image: myreg/scraper:1.4.2
  command: ["python", "-m", "scraper.worker"]
  resources:
  requests: {cpu: "200m", memory: "256Mi"}
  limits:  {cpu: "1",  memory: "1Gi"}
  env:
  - name: REDIS_URL
  valueFrom: {configMapKeyRef: {name: scraper-config, key: redis_url}}
  - name: PROXY_PASSWORD
  valueFrom: {secretKeyRef: {name: scraper-secrets, key: proxy_password}}
  ports:
  - {name: metrics, containerPort: 8000}
  livenessProbe:
  httpGet: {path: /health, port: 8000}
  initialDelaySeconds: 30
  periodSeconds: 30
  readinessProbe:
  httpGet: {path: /health, port: 8000}
  periodSeconds: 10

Four workers, each in their own pod, each with metrics on :8000. The cluster restarts unhealthy ones automatically.

Resource requests and limits

Field What it means
requests Guaranteed share. Used by the scheduler to place the pod on a node with capacity.
limits Maximum allowed. If memory exceeds, pod is OOMKilled. If CPU exceeds, throttled.

Set both. Without requests, your pod may be scheduled on an overloaded node. Without limits, one buggy worker can take down the whole node.

A daily CronJob

For a scraper that runs once a day:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: catalog108-daily
spec:
  schedule: "0 3 * * *"  # 3am UTC
  concurrencyPolicy: Forbid  # don't start a new run if previous still running
  jobTemplate:
  spec:
  backoffLimit: 2  # retry twice on failure
  template:
  spec:
  restartPolicy: OnFailure
  containers:
  - name: scrape
  image: myreg/scraper:1.4.2
  command: ["python", "-m", "scraper.daily"]
  resources:
  requests: {cpu: "500m", memory: "512Mi"}
  limits:  {cpu: "2",  memory: "2Gi"}

concurrencyPolicy: Forbid is essential for scrapers, running two instances of the same scrape concurrently usually means double the load on the target.

Coordinator + worker pattern in K8s

Standard production setup:

  • Deployment: scraper-worker (N replicas, pull from queue).
  • CronJob: scraper-coordinator (every 15 min, push URLs to queue).
  • Deployment: scraper-reaper (1 replica, recovers stuck jobs).
  • Service: scraper-metrics (front for workers' /metrics endpoints).

Each is independently scalable, deployable, and observable.

HPA, scale workers by queue depth

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-worker
spec:
  scaleTargetRef:
  kind: Deployment
  name: scraper-worker
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: External
  external:
  metric: {name: redis_queue_depth, selector: {matchLabels: {queue: "scrape:default"}}}
  target: {type: AverageValue, averageValue: "100"}

When the queue grows past ~100 items per worker, HPA spawns more pods up to 30. When it drains, it scales back. This requires a metrics adapter (prometheus-adapter or KEDA) that exposes the queue depth metric to the K8s API.

KEDA (Kubernetes Event-Driven Autoscaling) is the idiomatic choice for queue-driven scaling, it has first-class scalers for Redis, RabbitMQ, SQS, Kafka.

Networking and Services

If workers need a stable address (rare; typically they pull, not are pushed to), a Service:

apiVersion: v1
kind: Service
metadata:
  name: scraper-metrics
spec:
  selector: {app: scraper}
  ports:
  - {name: metrics, port: 8000, targetPort: 8000}

Service exposes worker pods at scraper-metrics:8000 inside the cluster. Prometheus's ServiceMonitor discovers them.

ConfigMap and Secret patterns

apiVersion: v1
kind: ConfigMap
metadata: {name: scraper-config}
data:
  redis_url: "redis://redis-primary:6379"
  target_sitemap: "https://practice.scrapingcentral.com/sitemap.xml"
  rate_limit_rps: "10"
---
apiVersion: v1
kind: Secret
metadata: {name: scraper-secrets}
type: Opaque
stringData:
  proxy_password: "..."
  api_key: "sk-..."

For real secret management, integrate with Vault, AWS Secrets Manager, or the External Secrets Operator. Don't commit secret YAML to git.

When NOT to use K8s for scrapers

  • Single scraper, single VM, simple cron.
  • You don't already have a K8s cluster.
  • Your team doesn't have on-call coverage for K8s itself.

Running K8s for one scraper is like buying a forklift to move one box. docker-compose or systemd + cron is enough.

What to try

Spin up a local K8s cluster with kind or k3d. Deploy a minimal scraper worker (your Catalog108 image) with 4 replicas. Watch:

  1. kubectl get pods, see them running.
  2. kubectl delete pod <name>, see Kubernetes recreate it.
  3. kubectl scale deployment scraper-worker --replicas=8, instantly more pods.

The control-loop pattern (declarative state, reconciler enforces it) is the entire mental model.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Kubernetes for Scraper Workloads (Overview)1 / 8

Which Kubernetes object is best for a daily scraper that should run on a schedule and not run two copies concurrently?

Score so far: 0 / 0