Kubernetes for Scraper Workloads (Overview), Production, Scale & Career

What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.

Kubernetes is overkill for a single scraper on one VM. It earns its slot when you have:

Multiple scraper fleets (different sites, different proxy configs).
Worker counts that scale up and down (e.g. a daily burst).
Multi-host deployments.
A team that already runs K8s for other workloads.

This lesson is a tour of what you need to know to run scrapers on K8s, not a Kubernetes course.

The objects you'll actually use

Object	What it is	Scraper use
Pod	One or more co-located containers	The unit of execution; you rarely create these directly
Deployment	Declarative spec for N replicas of a Pod	Long-running scrapers / workers
StatefulSet	Like Deployment, but with stable identities and persistent volumes	Stateful workers (rare for scrapers)
Job	One-shot pods that run to completion	Run a backfill once
CronJob	Scheduled Jobs	Daily / hourly scrapes
Service	Stable DNS / load-balancer for a set of pods	Expose a scraper's /metrics endpoint
ConfigMap	Non-secret config	spider settings, target lists
Secret	Sensitive config	API keys, proxy credentials
HorizontalPodAutoscaler	Scale replicas based on metrics	Auto-grow worker fleet by queue depth

90% of scraper workloads fit into Deployment + CronJob + Service + ConfigMap + Secret.

A scraper worker Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-worker
  labels: {app: scraper, role: worker}
spec:
  replicas: 4
  selector:
  matchLabels: {app: scraper, role: worker}
  template:
  metadata:
  labels: {app: scraper, role: worker}
  spec:
  containers:
  - name: worker
  image: myreg/scraper:1.4.2
  command: ["python", "-m", "scraper.worker"]
  resources:
  requests: {cpu: "200m", memory: "256Mi"}
  limits:  {cpu: "1",  memory: "1Gi"}
  env:
  - name: REDIS_URL
  valueFrom: {configMapKeyRef: {name: scraper-config, key: redis_url}}
  - name: PROXY_PASSWORD
  valueFrom: {secretKeyRef: {name: scraper-secrets, key: proxy_password}}
  ports:
  - {name: metrics, containerPort: 8000}
  livenessProbe:
  httpGet: {path: /health, port: 8000}
  initialDelaySeconds: 30
  periodSeconds: 30
  readinessProbe:
  httpGet: {path: /health, port: 8000}
  periodSeconds: 10

Four workers, each in their own pod, each with metrics on :8000. The cluster restarts unhealthy ones automatically.

Resource requests and limits

Field	What it means
`requests`	Guaranteed share. Used by the scheduler to place the pod on a node with capacity.
`limits`	Maximum allowed. If memory exceeds, pod is OOMKilled. If CPU exceeds, throttled.

Set both. Without requests, your pod may be scheduled on an overloaded node. Without limits, one buggy worker can take down the whole node.

A daily CronJob

For a scraper that runs once a day:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: catalog108-daily
spec:
  schedule: "0 3 * * *"  # 3am UTC
  concurrencyPolicy: Forbid  # don't start a new run if previous still running
  jobTemplate:
  spec:
  backoffLimit: 2  # retry twice on failure
  template:
  spec:
  restartPolicy: OnFailure
  containers:
  - name: scrape
  image: myreg/scraper:1.4.2
  command: ["python", "-m", "scraper.daily"]
  resources:
  requests: {cpu: "500m", memory: "512Mi"}
  limits:  {cpu: "2",  memory: "2Gi"}

concurrencyPolicy: Forbid is essential for scrapers, running two instances of the same scrape concurrently usually means double the load on the target.

Coordinator + worker pattern in K8s

Standard production setup:

Deployment: scraper-worker (N replicas, pull from queue).
CronJob: scraper-coordinator (every 15 min, push URLs to queue).
Deployment: scraper-reaper (1 replica, recovers stuck jobs).
Service: scraper-metrics (front for workers' /metrics endpoints).

Each is independently scalable, deployable, and observable.

HPA, scale workers by queue depth

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-worker
spec:
  scaleTargetRef:
  kind: Deployment
  name: scraper-worker
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: External
  external:
  metric: {name: redis_queue_depth, selector: {matchLabels: {queue: "scrape:default"}}}
  target: {type: AverageValue, averageValue: "100"}

When the queue grows past ~100 items per worker, HPA spawns more pods up to 30. When it drains, it scales back. This requires a metrics adapter (prometheus-adapter or KEDA) that exposes the queue depth metric to the K8s API.

KEDA (Kubernetes Event-Driven Autoscaling) is the idiomatic choice for queue-driven scaling, it has first-class scalers for Redis, RabbitMQ, SQS, Kafka.

Networking and Services

If workers need a stable address (rare; typically they pull, not are pushed to), a Service:

apiVersion: v1
kind: Service
metadata:
  name: scraper-metrics
spec:
  selector: {app: scraper}
  ports:
  - {name: metrics, port: 8000, targetPort: 8000}

Service exposes worker pods at scraper-metrics:8000 inside the cluster. Prometheus's ServiceMonitor discovers them.

ConfigMap and Secret patterns

apiVersion: v1
kind: ConfigMap
metadata: {name: scraper-config}
data:
  redis_url: "redis://redis-primary:6379"
  target_sitemap: "https://practice.scrapingcentral.com/sitemap.xml"
  rate_limit_rps: "10"
---
apiVersion: v1
kind: Secret
metadata: {name: scraper-secrets}
type: Opaque
stringData:
  proxy_password: "..."
  api_key: "sk-..."

For real secret management, integrate with Vault, AWS Secrets Manager, or the External Secrets Operator. Don't commit secret YAML to git.

When NOT to use K8s for scrapers

Single scraper, single VM, simple cron.
You don't already have a K8s cluster.
Your team doesn't have on-call coverage for K8s itself.

Running K8s for one scraper is like buying a forklift to move one box. docker-compose or systemd + cron is enough.

What to try

Spin up a local K8s cluster with kind or k3d. Deploy a minimal scraper worker (your Catalog108 image) with 4 replicas. Watch:

kubectl get pods, see them running.
kubectl delete pod <name>, see Kubernetes recreate it.
kubectl scale deployment scraper-worker --replicas=8, instantly more pods.

The control-loop pattern (declarative state, reconciler enforces it) is the entire mental model.

Kubernetes for Scraper Workloads (Overview)

What you’ll learn