Coordinator/Worker Patterns, Production, Scale & Career

The classical distributed-scraping architecture. Coordinator schedules; workers execute. Three concrete patterns and the right one for your scale.

The coordinator/worker pattern shows up in every distributed scraper. Coordinator decides what work exists; workers execute it. Three variants matter; each is appropriate for a different scale.

The three patterns

1. Pure pull (worker-initiated)

Workers ask the queue for work; queue serves whatever's next.

Workers → BLPOP → Queue ← LPUSH ← Coordinator

Simple. Most production setups.
Workers self-throttle (don't pull if busy).
Queue is the only shared state.

Used in: Redis-list queues, RabbitMQ standard consumers, Celery, Symfony Messenger workers.

2. Pure push (coordinator-initiated)

Coordinator assigns work to specific workers.

Coordinator → POST /work/123 → Worker
Coordinator → POST /work/124 → Worker

Coordinator tracks worker availability and load.
Useful when workers have specialized capabilities or sticky state.
Coordinator is single point of failure.

Used in: some custom scraping orchestrators, batch systems like Slurm.

3. Hybrid (work-stealing or priority pull)

Workers pull, but coordinator influences priority or rebalances.

Coordinator manages priority queues
  ↓
Workers pull from the queue matching their capability

More complex than pure pull.
Allows priority adjustments without restarting workers.
Coordinator failure degrades but doesn't kill the system.

For most scraping, pure pull is enough. The patterns get interesting once you have heterogeneous workers (some can use mobile proxies, some can't) or dynamic priorities (urgent vs background crawls).

Coordinator responsibilities

A coordinator in pull-mode systems is light:

Discover what to scrape (read sitemaps, schedule recurring jobs).
Push URLs to queues.
Maybe drain results, write to DB.

In push-mode, heavier:

Track which worker is doing what.
Re-assign on worker failure.
Balance load.

The lighter the coordinator, the smaller its blast radius if it fails.

Single-coordinator anti-pattern

A common mistake: one beefy coordinator that becomes the bottleneck and SPoF. Symptoms:

"If the coordinator goes down, the whole system stops."
"We need to vertically scale the coordinator."
"Coordinator deploys cause downtime."

Mitigations:

Make the coordinator stateless. State lives in the queue / DB. Coordinator can be replaced quickly.
Run multiple coordinator instances, each watching different schedules. Symfony Scheduler with stateful() + Lock = safe multi-instance.
For very-large systems, coordinator-as-leader-elected (Raft, etcd), overkill for typical scraping.

The cleanest setup: coordinator is just a cron job that pushes URLs. If the cron host is down, work pauses; queue continues; restart and resume.

A complete pattern in Python

# coordinator.py, run periodically (cron, K8s CronJob)
import redis
r = redis.Redis()

def discover_urls():
  # Read sitemaps, query upstream APIs, etc.
  return ["https://..."...]

def push_to_queue(urls):
  for url in urls:
  r.lpush("scrape:queue", json.dumps({"url": url, "discovered_at": time.time()}))

if __name__ == "__main__":
  urls = discover_urls()
  push_to_queue(urls)
  print(f"Pushed {len(urls)} URLs")

# worker.py, run as N processes (systemd, K8s Deployment)
import redis, httpx, json
r = redis.Redis()
client = httpx.Client(timeout=10)

while True:
  raw = r.brpoplpush("scrape:queue", "scrape:in_flight", timeout=60)
  if not raw: continue
  job = json.loads(raw)
  try:
  resp = client.get(job["url"])
  # write result to DB
  r.lrem("scrape:in_flight", 1, raw)
  except Exception:
  # leave in in_flight for reaper to handle
  time.sleep(1)

# reaper.py, runs every few minutes
# Move stale in_flight items back to the queue
in_flight = r.lrange("scrape:in_flight", 0, -1)
for raw in in_flight:
  job = json.loads(raw)
  if time.time() - job["discovered_at"] > 600:  # 10 min timeout
  r.lpush("scrape:queue", raw)
  r.lrem("scrape:in_flight", 1, raw)

Three small scripts: coordinator, worker, reaper. Together they're a fault-tolerant distributed scraper. Each is independently deployable.

Heterogeneous workers

Sometimes workers differ:

Some have residential proxies, some datacenter.
Some run browsers, some don't.
Some are in EU, some in US.

Pattern: multiple queues, each consumed by appropriate worker fleet.

# Coordinator decides which queue
if needs_residential(url):
  r.lpush("scrape:residential", json.dumps(job))
elif needs_browser(url):
  r.lpush("scrape:browser", json.dumps(job))
else:
  r.lpush("scrape:datacenter", json.dumps(job))

Workers consume only the queue they're equipped for:

# On residential-proxy hosts
python worker.py --queue scrape:residential
# On browser-equipped hosts
python worker.py --queue scrape:browser

This is essentially routing by capability, clean separation, easy to scale each fleet independently.

Priority queues

When some URLs are more urgent (e.g. retry of a high-value page), use priority:

Redis-style with multiple queues

def push_with_priority(url, priority):
  queues = ["scrape:high", "scrape:normal", "scrape:low"]
  r.lpush(queues[priority], url)

# Worker
for q in ["scrape:high", "scrape:normal", "scrape:low"]:
  work = r.rpop(q)
  if work: process(work); break

Workers check high before normal before low, strict priority.

Sorted-set priority

r.zadd("scrape:queue", {url: -priority_score})  # lower score wins
url = r.zpopmin("scrape:queue")

Sorted sets give numeric priority, useful when priorities are computed (e.g. "score = freshness × value").

Cross-region workers

For global scrapes, workers in EU and US (etc.) each pull from a region-tagged queue. Coordinator decides which region:

def queue_for(url):
  if must_be_eu(url): return "scrape:eu"
  if must_be_us(url): return "scrape:us"
  return "scrape:any"

Each region's worker fleet consumes its queue. URL geography matches scraper egress geography.

Hands-on lab

Build the three-script pattern (coordinator, worker, reaper) for Catalog108:

Coordinator pushes 100 product URLs.
Run 4 workers; watch them drain the queue.
Kill a worker mid-job; verify the reaper picks up the abandoned message.
Add a priority queue (some URLs go to scrape:high); confirm workers honour it.

You now have a real distributed scraper with fault tolerance and priority. Same architecture scales from 4 workers to 400.

Coordinator/Worker Patterns

What you’ll learn