Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.43intermediate5 min read

Redis as a Task Queue (rq, custom)

Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.

What you’ll learn

  • Use raw Redis lists for a basic task queue.
  • Use python-rq for typed jobs with retries.
  • Recognize when Redis isn't enough.

Redis ships nearly every Linux box. It's fast, simple, well-understood. For distributed scraping queues, it's the first place to reach. This lesson covers raw Redis lists, then python-rq, then where it stops being enough.

Raw Redis lists, the 20-line queue

# coordinator.py
import redis, json
r = redis.Redis()

for url in load_urls():
  r.lpush("scrape:queue", json.dumps({"url": url, "attempts": 0}))
# worker.py
import redis, json, httpx
r = redis.Redis()
client = httpx.Client(timeout=10)

while True:
  _, raw = r.blpop("scrape:queue", timeout=60) or (None, None)
  if not raw: continue
  job = json.loads(raw)
  try:
  resp = client.get(job["url"])
  r.lpush("results", json.dumps({"url": job["url"], "status": resp.status_code}))
  except Exception as e:
  job["attempts"] += 1
  if job["attempts"] < 3:
  r.lpush("scrape:queue", json.dumps(job))  # retry
  else:
  r.lpush("scrape:dead", json.dumps(job))  # dead letter

Atomic. BLPOP blocks up to N seconds. Multiple workers can pop concurrently, Redis guarantees one consumer per message.

Capabilities and limits

Works for:

  • Fire-and-forget jobs.
  • Light retry logic (push back to queue).
  • Dead-letter queue (separate list).
  • Priorities (multiple queues, sorted by priority).

Lacks:

  • Built-in retry strategy with backoff.
  • Visibility timeouts (if a worker dies mid-job, the message is lost unless you use RPOPLPUSH).
  • Per-job metadata storage.
  • Monitoring UI out of the box.

For "I want a queue, now," raw Redis is correct. For more, layer on a library.

RPOPLPUSH for safer pops

Naive BLPOP loses jobs if the worker crashes. RPOPLPUSH moves the job to an "in-flight" list atomically:

job = r.brpoplpush("scrape:queue", "scrape:in_flight", timeout=60)
try:
  process(job)
  r.lrem("scrape:in_flight", 1, job)  # remove on success
except Exception:
  pass  # leave in in_flight; a reaper picks it up

A separate reaper script periodically scans scrape:in_flight for stale entries (timestamp older than N minutes) and pushes them back to the main queue.

This pattern, visibility timeout via in-flight list + reaper, is a well-known Redis idiom for at-least-once semantics.

python-rq

rq (Redis Queue) wraps these patterns into a simple, typed API:

# tasks.py
def scrape_url(url):
  resp = httpx.get(url...)
  return {"url": url, "status": resp.status_code, "body": resp.text}
# enqueue
from rq import Queue
from redis import Redis
from tasks import scrape_url

q = Queue(connection=Redis())
for url in urls:
  job = q.enqueue(scrape_url, url)
# Run workers (one or many)
rq worker

rq handles:

  • Function-based jobs (auto-serialized).
  • Retries with backoff (Retry(max=3, interval=[1, 10, 60])).
  • Result storage (job.result).
  • Failed-job inspection.
  • Web dashboard (rq-dashboard).
# Retries
job = q.enqueue(scrape_url, url, retry=Retry(max=3, interval=[1, 5, 30]))
# Inspect failed jobs
rq info --queue failed

For most "I need typed jobs with retries," rq is the right balance of features and simplicity.

Job result handling

Two patterns:

  1. Synchronous result. Caller waits for job.result to populate. Useful for short jobs.

  2. Asynchronous result. Worker pushes results to a separate queue/store. Caller doesn't wait. Most production scrapers use this pattern.

For scraping specifically, results usually go to a database directly inside the job:

def scrape_url(url):
  resp = httpx.get(url...)
  conn = psycopg.connect(...)
  conn.execute("INSERT INTO products (...) VALUES (...) ON CONFLICT DO UPDATE...")
  conn.commit()

The job doesn't return data; it writes to the database as a side effect. Avoids unbounded result-queue growth.

Limits of Redis-based queues

Redis is in-memory. For a queue of 100k pending URLs, that's maybe ~30 MB, fine. For 100M pending URLs, that's 30 GB, uncomfortable. Consider:

  • Redis Cluster for scaling beyond single-node memory.
  • Postgres-backed queues (Symfony Messenger doctrine transport, Sidekiq, AlloyDB) when message volume + persistence outweigh latency.
  • Kafka or RabbitMQ when you need stream semantics or complex routing.

Most scraping never reaches these. Redis on a single $100/month VPS comfortably handles millions of jobs/day for typical workloads.

Per-host concurrency

A common need: max N workers fetching from a single host at once. Redis can coordinate via locks:

def fetch_with_host_lock(url, max_per_host=4):
  host = urlparse(url).netloc
  key = f"host:{host}"
  lock = r.lock(f"hostlock:{host}", timeout=30)
  if r.scard(key) >= max_per_host:
  # back-off, re-enqueue
  r.lpush("scrape:queue", url)
  return
  r.sadd(key, str(time.time()))
  try:
  return httpx.get(url)
  finally:
  r.srem(key...)

For polite-scraping purposes, layered with RateLimiter, this gives you both concurrency and rate caps.

Comparing Redis with Symfony Messenger

For PHP shops:

  • Messenger with redis transport gives you the same Redis-list semantics as raw Redis, with typed messages and worker management.
  • Messenger with doctrine transport gives you SQL-backed queues (slower but with full transactional integration).

The decision is the same as for Python: simpler is better; reach for richer queues when complexity justifies.

Hands-on lab

Build a minimal Redis-backed distributed scraper:

  1. Coordinator script that LPUSHes 1000 URLs to scrape:queue.
  2. Worker script that BLPOPs, fetches, INSERTs to Postgres.
  3. Run 4 workers in parallel terminals; watch the queue drain.
  4. Add visibility timeout via RPOPLPUSH + a reaper.

After this, you have an embarrassingly-parallel scraping platform in ~100 lines of code. That's the leverage Redis gives you.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Redis as a Task Queue (rq, custom)1 / 8

Which Redis command is the cornerstone of a basic blocking queue worker?

Score so far: 0 / 0