When You've Outgrown a Single Machine
Signals that your scraper needs to become distributed. The architectural patterns and the cost of crossing that line.
What you’ll learn
- Identify the signs that a single-machine scraper is hitting limits.
- Map the distributed-scraping primitives: queue, workers, coordinator.
- Decide whether to scale up (bigger box) vs scale out (more boxes).
Most scrapers don't need to be distributed. A well-tuned single-machine Scrapy or Symfony scraper can do tens of thousands of pages/hour. But there's a point where the math forces distribution. Recognizing it, and managing the cost, is the topic.
Signals you've hit the limit
| Signal | What it means |
|---|---|
| CPU pegged at 100% on parsing | Compute is the bottleneck; multi-core helps |
| RAM exhausted on long crawls | Memory leak or queue/dedup growing past capacity |
| Network bandwidth saturated | Single NIC can't keep up; parallel hosts help |
| Proxy IP per-request limits hit | Many machines = more proxy diversity |
| Wall-clock time too long for daily window | "Must complete in 6 hours" cannot fit a sequential scrape |
| Single-machine downtime = no scraping | Need redundancy |
Not all signals require distribution. Some require better single-machine code first.
Scale up vs scale out
The "scale up" path: bigger box. 32-core, 128 GB RAM, 10 Gbps. Often the cheapest answer for moderate scale, no distributed-systems complexity.
The "scale out" path: many smaller boxes. Required when:
- Network bandwidth needs exceed any single NIC.
- Proxy diversity needs many concurrent egress points.
- Geographic distribution matters (workers in multiple regions).
- Fault tolerance is required.
Rule of thumb: scale up until a single box hits ~$2k/month cost or until the bottleneck is network/redundancy. Then scale out.
The distributed primitives
┌────────────┐
│ Coordinator│ schedules and emits URLs to scrape
└──────┬─────┘
│
▼
┌──────────────────────────────┐
│ Queue (Redis / RabbitMQ) │
└──┬─────────┬──────────┬──────┘
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│ Worker ││ Worker ││ Worker │ fetch + parse, push results
└────┬───┘└────┬───┘└────┬───┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────┐
│ Result queue / database │
└──────────────────────────────┘
Three roles:
- Coordinator, knows what to scrape. Often a small Python or PHP process running on a cron.
- Queue, buffers URLs awaiting fetch. Redis, RabbitMQ, or Symfony Messenger transport.
- Workers, pull URLs, fetch, parse, push results. Stateless. Replicable.
State (URLs to fetch, dedup, in-flight) lives in the queue and a shared store. Workers can be killed and restarted at will.
What does NOT scale by adding workers
Some scrapers can't be parallelized:
-
Sequential dependent crawls. A scrape where step N's URL depends on step N-1's output. Can be parallelized at a coarser grain (multiple sequential chains running concurrently), not at the URL level.
-
Single-IP rate-limited targets. If the target allows 1 req/sec from any source, 100 workers don't help, they all queue. Reduce to one worker and use proxies.
-
Stateful login flows. One logged-in session can be used by one worker at a time. Parallelism requires multiple accounts.
For these, scaling out is wasted effort; scale up or rearchitect.
The cost of distribution
Things that get harder:
- Dedup. No more local set; needs a shared store (Redis SET, Bloom filter). Network round-trip per dedup check.
- Logging. Aggregated logs from many workers, Loki, ELK, or just file-based with shipping.
- State machine debugging. "Why did this URL disappear?" requires reading queue state, worker logs, result store.
- Idempotency. Retries across workers can re-execute work. Handler must be safely re-runnable.
- Backpressure. Result store filling faster than consumers drain it.
- Deployment. Worker code changes propagate to N machines.
Pick distribution because the gains outweigh these costs. Premature distribution is a common mistake.
A minimal architecture
The smallest distributed scraper:
# coordinator.py, runs on cron
import redis
r = redis.Redis()
for url in load_urls():
r.lpush("queue:scrape", url)
# worker.py, N processes/machines
import redis, requests
r = redis.Redis()
while True:
url = r.blpop("queue:scrape", timeout=60)[1]
response = requests.get(url...)
r.lpush("results", json.dumps({...}))
That's it. The Redis list IS the queue. Workers BLPOP, block-pop. Coordinator pushes. Result list grows; a separate consumer drains it to Postgres.
This is enough for many small-to-medium distributed scrapes. The hardest parts (retries, scheduling, monitoring) get added when you actually need them.
Symfony Messenger as the distribution layer
For Symfony shops, Messenger is the distributed-scraping toolkit:
# config/packages/messenger.yaml
framework:
messenger:
transports:
scrape:
dsn: 'redis://redis-cluster:6379/scrape'
options:
consumer:
prefetch_count: 10
Workers run on N machines:
# On each machine
php bin/console messenger:consume scrape --limit=1000 --time-limit=3600
Auto-scaling: more workers = more throughput. Each handler is independent. Result writes go through a separate Messenger transport or directly to Postgres.
Scrapy + Redis (scrapy-redis)
For Scrapy, scrapy-redis distributes spiders across machines using a shared Redis queue:
# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://redis-cluster:6379"
# spider:
from scrapy_redis.spiders import RedisSpider
class ProductSpider(RedisSpider):
name = "products"
redis_key = "products:start_urls"
Push start URLs to Redis; all instances of the spider pull from the same queue, dedup against a shared filter. Same code on N machines = horizontal scaling.
Migration path
Most projects evolve:
- Single Python script. Works for thousands of pages.
- Scrapy on one machine. Tens of thousands per hour.
- Scrapy + Redis queue. Hundreds of thousands per hour, multi-machine.
- Celery / RQ / Messenger. Multi-stage pipelines with different worker types.
- Kubernetes / autoscaling. Hundreds of thousands per hour, elastic.
Each step adds complexity. Don't skip ahead. The migration is usually triggered by a specific bottleneck, solve that bottleneck, don't add abstractions speculatively.
Hands-on lab
If you have a current scraper that runs on one machine:
- Measure throughput (items/hour) and bottleneck (CPU, RAM, network).
- Estimate cost of scaling up (bigger box) vs scaling out (3 boxes).
- List the state your scraper has (queue, dedup, cookies, session) and what would need to move to a shared store.
If "scale up" gets you 3x throughput at 1.5x cost, do that before distributing. The complexity tax of distribution rarely pays back at small-to-medium scale.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.