Using Message Queues for Scraping (Redis, RabbitMQ) - Deployment

Learn how to use Redis and RabbitMQ message queues to build reliable, distributed web scraping systems.

Message queues decouple URL discovery from scraping. One process finds URLs to scrape, another process does the scraping. This makes your system more reliable, scalable, and easier to maintain.

Why Use a Queue?

Reliability, if a worker crashes, the URL stays in the queue
Scalability, add more workers without changing code
Rate control, workers consume at their own pace
Retry logic, failed URLs go back to the queue
Priority, important URLs get scraped first

Option 1: Redis Queue with rq

rq (Redis Queue) is the simplest Python task queue:

pip install rq redis

# tasks.py
import requests
from bs4 import BeautifulSoup
import json

def scrape_url(url: str) -> dict:
    """Task that scrapes a single URL."""
    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
    })
    soup = BeautifulSoup(response.text, "html.parser")
    title = soup.find("title")

    result = {
        "url": url,
        "status": response.status_code,
        "title": title.text.strip() if title else "",
    }

    # Save result
    with open(f"/tmp/scrape_{hash(url)}.json", "w") as f:
        json.dump(result, f)

    return result

# enqueue_urls.py
from redis import Redis
from rq import Queue
from tasks import scrape_url

redis_conn = Redis()
q = Queue(connection=redis_conn)

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

for url in urls:
    job = q.enqueue(scrape_url, url, retry=3)
    print(f"Enqueued: {url} (job {job.id})")

print(f"Queue length: {len(q)}")

# Start workers (run in separate terminals or as services)
rq worker --with-scheduler
rq worker  # Add more workers for parallel processing

Option 2: Celery with Redis/RabbitMQ

Celery is more powerful, with scheduling, retries, and monitoring built in:

pip install celery[redis]

# celery_app.py
from celery import Celery
import requests
from bs4 import BeautifulSoup

app = Celery("scraper", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")

app.conf.update(
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],
    task_acks_late=True,       # Re-queue tasks if worker crashes
    worker_prefetch_multiplier=1,  # One task at a time per worker
    task_default_retry_delay=60,
    task_max_retries=3,
)

@app.task(bind=True, max_retries=3)
def scrape_url(self, url: str) -> dict:
    try:
        response = requests.get(url, timeout=30, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
        })
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        title = soup.find("title")
        return {
            "url": url,
            "title": title.text.strip() if title else "",
            "status": response.status_code,
        }
    except Exception as exc:
        raise self.retry(exc=exc)

# submit_tasks.py
from celery_app import scrape_url

urls = [f"https://example.com/page/{i}" for i in range(100)]

# Submit all URLs as tasks
results = []
for url in urls:
    result = scrape_url.delay(url)
    results.append(result)

# Check results
for r in results:
    print(r.get(timeout=120))

# Start Celery workers
celery -A celery_app worker --concurrency=4 --loglevel=info

# Monitor with Flower (optional)
pip install flower
celery -A celery_app flower

Option 3: RabbitMQ with pika

For more advanced routing and message guarantees:

import pika
import json

# Producer: enqueue URLs
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue="scrape_urls", durable=True)

urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
    channel.basic_publish(
        exchange="",
        routing_key="scrape_urls",
        body=json.dumps({"url": url}),
        properties=pika.BasicProperties(delivery_mode=2),  # Persistent
    )
    print(f"Enqueued: {url}")

connection.close()

Comparison

Feature	rq	Celery	RabbitMQ (raw)
Setup complexity	Low	Medium	High
Retries	Basic	Built-in	Manual
Monitoring	rq-dashboard	Flower	Management UI
Scheduling	Via rq-scheduler	Built-in beat	External
Best for	Simple jobs	Production scrapers	Custom routing

Recommendation

Start with rq for simple projects. Move to Celery when you need retries, scheduling, and monitoring. Use RabbitMQ directly only when you need custom message routing patterns.