Coordinator/Worker Patterns
The classical distributed-scraping architecture. Coordinator schedules; workers execute. Three concrete patterns and the right one for your scale.
What you’ll learn
- Implement a coordinator that emits crawl jobs.
- Differentiate pull, push, and hybrid worker models.
- Avoid the coordinator-single-point-of-failure trap.
The coordinator/worker pattern shows up in every distributed scraper. Coordinator decides what work exists; workers execute it. Three variants matter; each is appropriate for a different scale.
The three patterns
1. Pure pull (worker-initiated)
Workers ask the queue for work; queue serves whatever's next.
Workers → BLPOP → Queue ← LPUSH ← Coordinator
- Simple. Most production setups.
- Workers self-throttle (don't pull if busy).
- Queue is the only shared state.
Used in: Redis-list queues, RabbitMQ standard consumers, Celery, Symfony Messenger workers.
2. Pure push (coordinator-initiated)
Coordinator assigns work to specific workers.
Coordinator → POST /work/123 → Worker
Coordinator → POST /work/124 → Worker
- Coordinator tracks worker availability and load.
- Useful when workers have specialized capabilities or sticky state.
- Coordinator is single point of failure.
Used in: some custom scraping orchestrators, batch systems like Slurm.
3. Hybrid (work-stealing or priority pull)
Workers pull, but coordinator influences priority or rebalances.
Coordinator manages priority queues
↓
Workers pull from the queue matching their capability
- More complex than pure pull.
- Allows priority adjustments without restarting workers.
- Coordinator failure degrades but doesn't kill the system.
For most scraping, pure pull is enough. The patterns get interesting once you have heterogeneous workers (some can use mobile proxies, some can't) or dynamic priorities (urgent vs background crawls).
Coordinator responsibilities
A coordinator in pull-mode systems is light:
- Discover what to scrape (read sitemaps, schedule recurring jobs).
- Push URLs to queues.
- Maybe drain results, write to DB.
In push-mode, heavier:
- Track which worker is doing what.
- Re-assign on worker failure.
- Balance load.
The lighter the coordinator, the smaller its blast radius if it fails.
Single-coordinator anti-pattern
A common mistake: one beefy coordinator that becomes the bottleneck and SPoF. Symptoms:
- "If the coordinator goes down, the whole system stops."
- "We need to vertically scale the coordinator."
- "Coordinator deploys cause downtime."
Mitigations:
- Make the coordinator stateless. State lives in the queue / DB. Coordinator can be replaced quickly.
- Run multiple coordinator instances, each watching different schedules. Symfony Scheduler with stateful() + Lock = safe multi-instance.
- For very-large systems, coordinator-as-leader-elected (Raft, etcd), overkill for typical scraping.
The cleanest setup: coordinator is just a cron job that pushes URLs. If the cron host is down, work pauses; queue continues; restart and resume.
A complete pattern in Python
# coordinator.py, run periodically (cron, K8s CronJob)
import redis
r = redis.Redis()
def discover_urls():
# Read sitemaps, query upstream APIs, etc.
return ["https://..."...]
def push_to_queue(urls):
for url in urls:
r.lpush("scrape:queue", json.dumps({"url": url, "discovered_at": time.time()}))
if __name__ == "__main__":
urls = discover_urls()
push_to_queue(urls)
print(f"Pushed {len(urls)} URLs")
# worker.py, run as N processes (systemd, K8s Deployment)
import redis, httpx, json
r = redis.Redis()
client = httpx.Client(timeout=10)
while True:
raw = r.brpoplpush("scrape:queue", "scrape:in_flight", timeout=60)
if not raw: continue
job = json.loads(raw)
try:
resp = client.get(job["url"])
# write result to DB
r.lrem("scrape:in_flight", 1, raw)
except Exception:
# leave in in_flight for reaper to handle
time.sleep(1)
# reaper.py, runs every few minutes
# Move stale in_flight items back to the queue
in_flight = r.lrange("scrape:in_flight", 0, -1)
for raw in in_flight:
job = json.loads(raw)
if time.time() - job["discovered_at"] > 600: # 10 min timeout
r.lpush("scrape:queue", raw)
r.lrem("scrape:in_flight", 1, raw)
Three small scripts: coordinator, worker, reaper. Together they're a fault-tolerant distributed scraper. Each is independently deployable.
Heterogeneous workers
Sometimes workers differ:
- Some have residential proxies, some datacenter.
- Some run browsers, some don't.
- Some are in EU, some in US.
Pattern: multiple queues, each consumed by appropriate worker fleet.
# Coordinator decides which queue
if needs_residential(url):
r.lpush("scrape:residential", json.dumps(job))
elif needs_browser(url):
r.lpush("scrape:browser", json.dumps(job))
else:
r.lpush("scrape:datacenter", json.dumps(job))
Workers consume only the queue they're equipped for:
# On residential-proxy hosts
python worker.py --queue scrape:residential
# On browser-equipped hosts
python worker.py --queue scrape:browser
This is essentially routing by capability, clean separation, easy to scale each fleet independently.
Priority queues
When some URLs are more urgent (e.g. retry of a high-value page), use priority:
Redis-style with multiple queues
def push_with_priority(url, priority):
queues = ["scrape:high", "scrape:normal", "scrape:low"]
r.lpush(queues[priority], url)
# Worker
for q in ["scrape:high", "scrape:normal", "scrape:low"]:
work = r.rpop(q)
if work: process(work); break
Workers check high before normal before low, strict priority.
Sorted-set priority
r.zadd("scrape:queue", {url: -priority_score}) # lower score wins
url = r.zpopmin("scrape:queue")
Sorted sets give numeric priority, useful when priorities are computed (e.g. "score = freshness × value").
Cross-region workers
For global scrapes, workers in EU and US (etc.) each pull from a region-tagged queue. Coordinator decides which region:
def queue_for(url):
if must_be_eu(url): return "scrape:eu"
if must_be_us(url): return "scrape:us"
return "scrape:any"
Each region's worker fleet consumes its queue. URL geography matches scraper egress geography.
Hands-on lab
Build the three-script pattern (coordinator, worker, reaper) for Catalog108:
- Coordinator pushes 100 product URLs.
- Run 4 workers; watch them drain the queue.
- Kill a worker mid-job; verify the reaper picks up the abandoned message.
- Add a priority queue (some URLs go to
scrape:high); confirm workers honour it.
You now have a real distributed scraper with fault tolerance and priority. Same architecture scales from 4 workers to 400.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.