Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.42advanced5 min read

When You've Outgrown a Single Machine

Signals that your scraper needs to become distributed. The architectural patterns and the cost of crossing that line.

What you’ll learn

  • Identify the signs that a single-machine scraper is hitting limits.
  • Map the distributed-scraping primitives: queue, workers, coordinator.
  • Decide whether to scale up (bigger box) vs scale out (more boxes).

Most scrapers don't need to be distributed. A well-tuned single-machine Scrapy or Symfony scraper can do tens of thousands of pages/hour. But there's a point where the math forces distribution. Recognizing it, and managing the cost, is the topic.

Signals you've hit the limit

Signal What it means
CPU pegged at 100% on parsing Compute is the bottleneck; multi-core helps
RAM exhausted on long crawls Memory leak or queue/dedup growing past capacity
Network bandwidth saturated Single NIC can't keep up; parallel hosts help
Proxy IP per-request limits hit Many machines = more proxy diversity
Wall-clock time too long for daily window "Must complete in 6 hours" cannot fit a sequential scrape
Single-machine downtime = no scraping Need redundancy

Not all signals require distribution. Some require better single-machine code first.

Scale up vs scale out

The "scale up" path: bigger box. 32-core, 128 GB RAM, 10 Gbps. Often the cheapest answer for moderate scale, no distributed-systems complexity.

The "scale out" path: many smaller boxes. Required when:

  • Network bandwidth needs exceed any single NIC.
  • Proxy diversity needs many concurrent egress points.
  • Geographic distribution matters (workers in multiple regions).
  • Fault tolerance is required.

Rule of thumb: scale up until a single box hits ~$2k/month cost or until the bottleneck is network/redundancy. Then scale out.

The distributed primitives

┌────────────┐
│ Coordinator│  schedules and emits URLs to scrape
└──────┬─────┘
  │
  ▼
┌──────────────────────────────┐
│  Queue (Redis / RabbitMQ)  │
└──┬─────────┬──────────┬──────┘
  ▼  ▼  ▼
┌────────┐┌────────┐┌────────┐
│ Worker ││ Worker ││ Worker │  fetch + parse, push results
└────┬───┘└────┬───┘└────┬───┘
  │  │  │
  ▼  ▼  ▼
┌──────────────────────────────┐
│  Result queue / database  │
└──────────────────────────────┘

Three roles:

  1. Coordinator, knows what to scrape. Often a small Python or PHP process running on a cron.
  2. Queue, buffers URLs awaiting fetch. Redis, RabbitMQ, or Symfony Messenger transport.
  3. Workers, pull URLs, fetch, parse, push results. Stateless. Replicable.

State (URLs to fetch, dedup, in-flight) lives in the queue and a shared store. Workers can be killed and restarted at will.

What does NOT scale by adding workers

Some scrapers can't be parallelized:

  • Sequential dependent crawls. A scrape where step N's URL depends on step N-1's output. Can be parallelized at a coarser grain (multiple sequential chains running concurrently), not at the URL level.

  • Single-IP rate-limited targets. If the target allows 1 req/sec from any source, 100 workers don't help, they all queue. Reduce to one worker and use proxies.

  • Stateful login flows. One logged-in session can be used by one worker at a time. Parallelism requires multiple accounts.

For these, scaling out is wasted effort; scale up or rearchitect.

The cost of distribution

Things that get harder:

  • Dedup. No more local set; needs a shared store (Redis SET, Bloom filter). Network round-trip per dedup check.
  • Logging. Aggregated logs from many workers, Loki, ELK, or just file-based with shipping.
  • State machine debugging. "Why did this URL disappear?" requires reading queue state, worker logs, result store.
  • Idempotency. Retries across workers can re-execute work. Handler must be safely re-runnable.
  • Backpressure. Result store filling faster than consumers drain it.
  • Deployment. Worker code changes propagate to N machines.

Pick distribution because the gains outweigh these costs. Premature distribution is a common mistake.

A minimal architecture

The smallest distributed scraper:

# coordinator.py, runs on cron
import redis
r = redis.Redis()
for url in load_urls():
  r.lpush("queue:scrape", url)
# worker.py, N processes/machines
import redis, requests
r = redis.Redis()
while True:
  url = r.blpop("queue:scrape", timeout=60)[1]
  response = requests.get(url...)
  r.lpush("results", json.dumps({...}))

That's it. The Redis list IS the queue. Workers BLPOP, block-pop. Coordinator pushes. Result list grows; a separate consumer drains it to Postgres.

This is enough for many small-to-medium distributed scrapes. The hardest parts (retries, scheduling, monitoring) get added when you actually need them.

Symfony Messenger as the distribution layer

For Symfony shops, Messenger is the distributed-scraping toolkit:

# config/packages/messenger.yaml
framework:
  messenger:
  transports:
  scrape:
  dsn: 'redis://redis-cluster:6379/scrape'
  options:
  consumer:
  prefetch_count: 10

Workers run on N machines:

# On each machine
php bin/console messenger:consume scrape --limit=1000 --time-limit=3600

Auto-scaling: more workers = more throughput. Each handler is independent. Result writes go through a separate Messenger transport or directly to Postgres.

Scrapy + Redis (scrapy-redis)

For Scrapy, scrapy-redis distributes spiders across machines using a shared Redis queue:

# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://redis-cluster:6379"

# spider:
from scrapy_redis.spiders import RedisSpider
class ProductSpider(RedisSpider):
  name = "products"
  redis_key = "products:start_urls"

Push start URLs to Redis; all instances of the spider pull from the same queue, dedup against a shared filter. Same code on N machines = horizontal scaling.

Migration path

Most projects evolve:

  1. Single Python script. Works for thousands of pages.
  2. Scrapy on one machine. Tens of thousands per hour.
  3. Scrapy + Redis queue. Hundreds of thousands per hour, multi-machine.
  4. Celery / RQ / Messenger. Multi-stage pipelines with different worker types.
  5. Kubernetes / autoscaling. Hundreds of thousands per hour, elastic.

Each step adds complexity. Don't skip ahead. The migration is usually triggered by a specific bottleneck, solve that bottleneck, don't add abstractions speculatively.

Hands-on lab

If you have a current scraper that runs on one machine:

  1. Measure throughput (items/hour) and bottleneck (CPU, RAM, network).
  2. Estimate cost of scaling up (bigger box) vs scaling out (3 boxes).
  3. List the state your scraper has (queue, dedup, cookies, session) and what would need to move to a shared store.

If "scale up" gets you 3x throughput at 1.5x cost, do that before distributing. The complexity tax of distribution rarely pays back at small-to-medium scale.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

When You've Outgrown a Single Machine1 / 8

Which signal does NOT necessarily indicate you need to scale OUT (more machines)?

Score so far: 0 / 0