Redis as a Task Queue (rq, custom), Production, Scale & Career

Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.

Redis ships nearly every Linux box. It's fast, simple, well-understood. For distributed scraping queues, it's the first place to reach. This lesson covers raw Redis lists, then python-rq, then where it stops being enough.

Raw Redis lists, the 20-line queue

# coordinator.py
import redis, json
r = redis.Redis()

for url in load_urls():
  r.lpush("scrape:queue", json.dumps({"url": url, "attempts": 0}))

# worker.py
import redis, json, httpx
r = redis.Redis()
client = httpx.Client(timeout=10)

while True:
  _, raw = r.blpop("scrape:queue", timeout=60) or (None, None)
  if not raw: continue
  job = json.loads(raw)
  try:
  resp = client.get(job["url"])
  r.lpush("results", json.dumps({"url": job["url"], "status": resp.status_code}))
  except Exception as e:
  job["attempts"] += 1
  if job["attempts"] < 3:
  r.lpush("scrape:queue", json.dumps(job))  # retry
  else:
  r.lpush("scrape:dead", json.dumps(job))  # dead letter

Atomic. BLPOP blocks up to N seconds. Multiple workers can pop concurrently, Redis guarantees one consumer per message.

Capabilities and limits

Works for:

Fire-and-forget jobs.
Light retry logic (push back to queue).
Dead-letter queue (separate list).
Priorities (multiple queues, sorted by priority).

Lacks:

Built-in retry strategy with backoff.
Visibility timeouts (if a worker dies mid-job, the message is lost unless you use RPOPLPUSH).
Per-job metadata storage.
Monitoring UI out of the box.

For "I want a queue, now," raw Redis is correct. For more, layer on a library.

RPOPLPUSH for safer pops

Naive BLPOP loses jobs if the worker crashes. RPOPLPUSH moves the job to an "in-flight" list atomically:

job = r.brpoplpush("scrape:queue", "scrape:in_flight", timeout=60)
try:
  process(job)
  r.lrem("scrape:in_flight", 1, job)  # remove on success
except Exception:
  pass  # leave in in_flight; a reaper picks it up

A separate reaper script periodically scans scrape:in_flight for stale entries (timestamp older than N minutes) and pushes them back to the main queue.

This pattern, visibility timeout via in-flight list + reaper, is a well-known Redis idiom for at-least-once semantics.

python-rq

rq (Redis Queue) wraps these patterns into a simple, typed API:

# tasks.py
def scrape_url(url):
  resp = httpx.get(url...)
  return {"url": url, "status": resp.status_code, "body": resp.text}

# enqueue
from rq import Queue
from redis import Redis
from tasks import scrape_url

q = Queue(connection=Redis())
for url in urls:
  job = q.enqueue(scrape_url, url)

# Run workers (one or many)
rq worker

rq handles:

Function-based jobs (auto-serialized).
Retries with backoff (Retry(max=3, interval=[1, 10, 60])).
Result storage (job.result).
Failed-job inspection.
Web dashboard (rq-dashboard).

# Retries
job = q.enqueue(scrape_url, url, retry=Retry(max=3, interval=[1, 5, 30]))

# Inspect failed jobs
rq info --queue failed

For most "I need typed jobs with retries," rq is the right balance of features and simplicity.

Job result handling

Two patterns:

Synchronous result. Caller waits for job.result to populate. Useful for short jobs.
Asynchronous result. Worker pushes results to a separate queue/store. Caller doesn't wait. Most production scrapers use this pattern.

For scraping specifically, results usually go to a database directly inside the job:

def scrape_url(url):
  resp = httpx.get(url...)
  conn = psycopg.connect(...)
  conn.execute("INSERT INTO products (...) VALUES (...) ON CONFLICT DO UPDATE...")
  conn.commit()

The job doesn't return data; it writes to the database as a side effect. Avoids unbounded result-queue growth.

Limits of Redis-based queues

Redis is in-memory. For a queue of 100k pending URLs, that's maybe ~30 MB, fine. For 100M pending URLs, that's 30 GB, uncomfortable. Consider:

Redis Cluster for scaling beyond single-node memory.
Postgres-backed queues (Symfony Messenger doctrine transport, Sidekiq, AlloyDB) when message volume + persistence outweigh latency.
Kafka or RabbitMQ when you need stream semantics or complex routing.

Most scraping never reaches these. Redis on a single $100/month VPS comfortably handles millions of jobs/day for typical workloads.

Per-host concurrency

A common need: max N workers fetching from a single host at once. Redis can coordinate via locks:

def fetch_with_host_lock(url, max_per_host=4):
  host = urlparse(url).netloc
  key = f"host:{host}"
  lock = r.lock(f"hostlock:{host}", timeout=30)
  if r.scard(key) >= max_per_host:
  # back-off, re-enqueue
  r.lpush("scrape:queue", url)
  return
  r.sadd(key, str(time.time()))
  try:
  return httpx.get(url)
  finally:
  r.srem(key...)

For polite-scraping purposes, layered with RateLimiter, this gives you both concurrency and rate caps.

Comparing Redis with Symfony Messenger

For PHP shops:

Messenger with redis transport gives you the same Redis-list semantics as raw Redis, with typed messages and worker management.
Messenger with doctrine transport gives you SQL-backed queues (slower but with full transactional integration).

The decision is the same as for Python: simpler is better; reach for richer queues when complexity justifies.

Hands-on lab

Build a minimal Redis-backed distributed scraper:

Coordinator script that LPUSHes 1000 URLs to scrape:queue.
Worker script that BLPOPs, fetches, INSERTs to Postgres.
Run 4 workers in parallel terminals; watch the queue drain.
Add visibility timeout via RPOPLPUSH + a reaper.

After this, you have an embarrassingly-parallel scraping platform in ~100 lines of code. That's the leverage Redis gives you.

Redis as a Task Queue (rq, custom)

What you’ll learn