Kubernetes for Scraper Workloads (Overview)
What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.
What you’ll learn
- Name the core Kubernetes resources used by a scraper.
- Decide between Deployment, StatefulSet, Job, and CronJob for scraper variants.
- Write a minimal manifest for a scraper + worker fleet.
Kubernetes is overkill for a single scraper on one VM. It earns its slot when you have:
- Multiple scraper fleets (different sites, different proxy configs).
- Worker counts that scale up and down (e.g. a daily burst).
- Multi-host deployments.
- A team that already runs K8s for other workloads.
This lesson is a tour of what you need to know to run scrapers on K8s, not a Kubernetes course.
The objects you'll actually use
| Object | What it is | Scraper use |
|---|---|---|
| Pod | One or more co-located containers | The unit of execution; you rarely create these directly |
| Deployment | Declarative spec for N replicas of a Pod | Long-running scrapers / workers |
| StatefulSet | Like Deployment, but with stable identities and persistent volumes | Stateful workers (rare for scrapers) |
| Job | One-shot pods that run to completion | Run a backfill once |
| CronJob | Scheduled Jobs | Daily / hourly scrapes |
| Service | Stable DNS / load-balancer for a set of pods | Expose a scraper's /metrics endpoint |
| ConfigMap | Non-secret config | spider settings, target lists |
| Secret | Sensitive config | API keys, proxy credentials |
| HorizontalPodAutoscaler | Scale replicas based on metrics | Auto-grow worker fleet by queue depth |
90% of scraper workloads fit into Deployment + CronJob + Service + ConfigMap + Secret.
A scraper worker Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-worker
labels: {app: scraper, role: worker}
spec:
replicas: 4
selector:
matchLabels: {app: scraper, role: worker}
template:
metadata:
labels: {app: scraper, role: worker}
spec:
containers:
- name: worker
image: myreg/scraper:1.4.2
command: ["python", "-m", "scraper.worker"]
resources:
requests: {cpu: "200m", memory: "256Mi"}
limits: {cpu: "1", memory: "1Gi"}
env:
- name: REDIS_URL
valueFrom: {configMapKeyRef: {name: scraper-config, key: redis_url}}
- name: PROXY_PASSWORD
valueFrom: {secretKeyRef: {name: scraper-secrets, key: proxy_password}}
ports:
- {name: metrics, containerPort: 8000}
livenessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet: {path: /health, port: 8000}
periodSeconds: 10
Four workers, each in their own pod, each with metrics on :8000. The cluster restarts unhealthy ones automatically.
Resource requests and limits
| Field | What it means |
|---|---|
requests |
Guaranteed share. Used by the scheduler to place the pod on a node with capacity. |
limits |
Maximum allowed. If memory exceeds, pod is OOMKilled. If CPU exceeds, throttled. |
Set both. Without requests, your pod may be scheduled on an overloaded node. Without limits, one buggy worker can take down the whole node.
A daily CronJob
For a scraper that runs once a day:
apiVersion: batch/v1
kind: CronJob
metadata:
name: catalog108-daily
spec:
schedule: "0 3 * * *" # 3am UTC
concurrencyPolicy: Forbid # don't start a new run if previous still running
jobTemplate:
spec:
backoffLimit: 2 # retry twice on failure
template:
spec:
restartPolicy: OnFailure
containers:
- name: scrape
image: myreg/scraper:1.4.2
command: ["python", "-m", "scraper.daily"]
resources:
requests: {cpu: "500m", memory: "512Mi"}
limits: {cpu: "2", memory: "2Gi"}
concurrencyPolicy: Forbid is essential for scrapers, running two instances of the same scrape concurrently usually means double the load on the target.
Coordinator + worker pattern in K8s
Standard production setup:
- Deployment: scraper-worker (N replicas, pull from queue).
- CronJob: scraper-coordinator (every 15 min, push URLs to queue).
- Deployment: scraper-reaper (1 replica, recovers stuck jobs).
- Service: scraper-metrics (front for workers' /metrics endpoints).
Each is independently scalable, deployable, and observable.
HPA, scale workers by queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scraper-worker
spec:
scaleTargetRef:
kind: Deployment
name: scraper-worker
minReplicas: 2
maxReplicas: 30
metrics:
- type: External
external:
metric: {name: redis_queue_depth, selector: {matchLabels: {queue: "scrape:default"}}}
target: {type: AverageValue, averageValue: "100"}
When the queue grows past ~100 items per worker, HPA spawns more pods up to 30. When it drains, it scales back. This requires a metrics adapter (prometheus-adapter or KEDA) that exposes the queue depth metric to the K8s API.
KEDA (Kubernetes Event-Driven Autoscaling) is the idiomatic choice for queue-driven scaling, it has first-class scalers for Redis, RabbitMQ, SQS, Kafka.
Networking and Services
If workers need a stable address (rare; typically they pull, not are pushed to), a Service:
apiVersion: v1
kind: Service
metadata:
name: scraper-metrics
spec:
selector: {app: scraper}
ports:
- {name: metrics, port: 8000, targetPort: 8000}
Service exposes worker pods at scraper-metrics:8000 inside the cluster. Prometheus's ServiceMonitor discovers them.
ConfigMap and Secret patterns
apiVersion: v1
kind: ConfigMap
metadata: {name: scraper-config}
data:
redis_url: "redis://redis-primary:6379"
target_sitemap: "https://practice.scrapingcentral.com/sitemap.xml"
rate_limit_rps: "10"
---
apiVersion: v1
kind: Secret
metadata: {name: scraper-secrets}
type: Opaque
stringData:
proxy_password: "..."
api_key: "sk-..."
For real secret management, integrate with Vault, AWS Secrets Manager, or the External Secrets Operator. Don't commit secret YAML to git.
When NOT to use K8s for scrapers
- Single scraper, single VM, simple cron.
- You don't already have a K8s cluster.
- Your team doesn't have on-call coverage for K8s itself.
Running K8s for one scraper is like buying a forklift to move one box. docker-compose or systemd + cron is enough.
What to try
Spin up a local K8s cluster with kind or k3d. Deploy a minimal scraper worker (your Catalog108 image) with 4 replicas. Watch:
kubectl get pods, see them running.kubectl delete pod <name>, see Kubernetes recreate it.kubectl scale deployment scraper-worker --replicas=8, instantly more pods.
The control-loop pattern (declarative state, reconciler enforces it) is the entire mental model.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.