Redis as a Task Queue (rq, custom)
Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.
What you’ll learn
- Use raw Redis lists for a basic task queue.
- Use python-rq for typed jobs with retries.
- Recognize when Redis isn't enough.
Redis ships nearly every Linux box. It's fast, simple, well-understood. For distributed scraping queues, it's the first place to reach. This lesson covers raw Redis lists, then python-rq, then where it stops being enough.
Raw Redis lists, the 20-line queue
# coordinator.py
import redis, json
r = redis.Redis()
for url in load_urls():
r.lpush("scrape:queue", json.dumps({"url": url, "attempts": 0}))
# worker.py
import redis, json, httpx
r = redis.Redis()
client = httpx.Client(timeout=10)
while True:
_, raw = r.blpop("scrape:queue", timeout=60) or (None, None)
if not raw: continue
job = json.loads(raw)
try:
resp = client.get(job["url"])
r.lpush("results", json.dumps({"url": job["url"], "status": resp.status_code}))
except Exception as e:
job["attempts"] += 1
if job["attempts"] < 3:
r.lpush("scrape:queue", json.dumps(job)) # retry
else:
r.lpush("scrape:dead", json.dumps(job)) # dead letter
Atomic. BLPOP blocks up to N seconds. Multiple workers can pop concurrently, Redis guarantees one consumer per message.
Capabilities and limits
Works for:
- Fire-and-forget jobs.
- Light retry logic (push back to queue).
- Dead-letter queue (separate list).
- Priorities (multiple queues, sorted by priority).
Lacks:
- Built-in retry strategy with backoff.
- Visibility timeouts (if a worker dies mid-job, the message is lost unless you use RPOPLPUSH).
- Per-job metadata storage.
- Monitoring UI out of the box.
For "I want a queue, now," raw Redis is correct. For more, layer on a library.
RPOPLPUSH for safer pops
Naive BLPOP loses jobs if the worker crashes. RPOPLPUSH moves the job to an "in-flight" list atomically:
job = r.brpoplpush("scrape:queue", "scrape:in_flight", timeout=60)
try:
process(job)
r.lrem("scrape:in_flight", 1, job) # remove on success
except Exception:
pass # leave in in_flight; a reaper picks it up
A separate reaper script periodically scans scrape:in_flight for stale entries (timestamp older than N minutes) and pushes them back to the main queue.
This pattern, visibility timeout via in-flight list + reaper, is a well-known Redis idiom for at-least-once semantics.
python-rq
rq (Redis Queue) wraps these patterns into a simple, typed API:
# tasks.py
def scrape_url(url):
resp = httpx.get(url...)
return {"url": url, "status": resp.status_code, "body": resp.text}
# enqueue
from rq import Queue
from redis import Redis
from tasks import scrape_url
q = Queue(connection=Redis())
for url in urls:
job = q.enqueue(scrape_url, url)
# Run workers (one or many)
rq worker
rq handles:
- Function-based jobs (auto-serialized).
- Retries with backoff (
Retry(max=3, interval=[1, 10, 60])). - Result storage (
job.result). - Failed-job inspection.
- Web dashboard (
rq-dashboard).
# Retries
job = q.enqueue(scrape_url, url, retry=Retry(max=3, interval=[1, 5, 30]))
# Inspect failed jobs
rq info --queue failed
For most "I need typed jobs with retries," rq is the right balance of features and simplicity.
Job result handling
Two patterns:
-
Synchronous result. Caller waits for
job.resultto populate. Useful for short jobs. -
Asynchronous result. Worker pushes results to a separate queue/store. Caller doesn't wait. Most production scrapers use this pattern.
For scraping specifically, results usually go to a database directly inside the job:
def scrape_url(url):
resp = httpx.get(url...)
conn = psycopg.connect(...)
conn.execute("INSERT INTO products (...) VALUES (...) ON CONFLICT DO UPDATE...")
conn.commit()
The job doesn't return data; it writes to the database as a side effect. Avoids unbounded result-queue growth.
Limits of Redis-based queues
Redis is in-memory. For a queue of 100k pending URLs, that's maybe ~30 MB, fine. For 100M pending URLs, that's 30 GB, uncomfortable. Consider:
- Redis Cluster for scaling beyond single-node memory.
- Postgres-backed queues (Symfony Messenger doctrine transport, Sidekiq, AlloyDB) when message volume + persistence outweigh latency.
- Kafka or RabbitMQ when you need stream semantics or complex routing.
Most scraping never reaches these. Redis on a single $100/month VPS comfortably handles millions of jobs/day for typical workloads.
Per-host concurrency
A common need: max N workers fetching from a single host at once. Redis can coordinate via locks:
def fetch_with_host_lock(url, max_per_host=4):
host = urlparse(url).netloc
key = f"host:{host}"
lock = r.lock(f"hostlock:{host}", timeout=30)
if r.scard(key) >= max_per_host:
# back-off, re-enqueue
r.lpush("scrape:queue", url)
return
r.sadd(key, str(time.time()))
try:
return httpx.get(url)
finally:
r.srem(key...)
For polite-scraping purposes, layered with RateLimiter, this gives you both concurrency and rate caps.
Comparing Redis with Symfony Messenger
For PHP shops:
- Messenger with redis transport gives you the same Redis-list semantics as raw Redis, with typed messages and worker management.
- Messenger with doctrine transport gives you SQL-backed queues (slower but with full transactional integration).
The decision is the same as for Python: simpler is better; reach for richer queues when complexity justifies.
Hands-on lab
Build a minimal Redis-backed distributed scraper:
- Coordinator script that LPUSHes 1000 URLs to
scrape:queue. - Worker script that BLPOPs, fetches, INSERTs to Postgres.
- Run 4 workers in parallel terminals; watch the queue drain.
- Add visibility timeout via RPOPLPUSH + a reaper.
After this, you have an embarrassingly-parallel scraping platform in ~100 lines of code. That's the leverage Redis gives you.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.