Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Using Message Queues for Scraping (Redis, RabbitMQ)

Learn how to use Redis and RabbitMQ message queues to build reliable, distributed web scraping systems.

Deployment · #9intermediate3 min read
Share:WhatsAppLinkedIn

Message queues decouple URL discovery from scraping. One process finds URLs to scrape, another process does the scraping. This makes your system more reliable, scalable, and easier to maintain.

Why Use a Queue?

  • Reliability, if a worker crashes, the URL stays in the queue
  • Scalability, add more workers without changing code
  • Rate control, workers consume at their own pace
  • Retry logic, failed URLs go back to the queue
  • Priority, important URLs get scraped first

Option 1: Redis Queue with rq

rq (Redis Queue) is the simplest Python task queue:

pip install rq redis
# tasks.py
import requests
from bs4 import BeautifulSoup
import json

def scrape_url(url: str) -> dict:
    """Task that scrapes a single URL."""
    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
    })
    soup = BeautifulSoup(response.text, "html.parser")
    title = soup.find("title")

    result = {
        "url": url,
        "status": response.status_code,
        "title": title.text.strip() if title else "",
    }

    # Save result
    with open(f"/tmp/scrape_{hash(url)}.json", "w") as f:
        json.dump(result, f)

    return result
# enqueue_urls.py
from redis import Redis
from rq import Queue
from tasks import scrape_url

redis_conn = Redis()
q = Queue(connection=redis_conn)

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

for url in urls:
    job = q.enqueue(scrape_url, url, retry=3)
    print(f"Enqueued: {url} (job {job.id})")

print(f"Queue length: {len(q)}")
# Start workers (run in separate terminals or as services)
rq worker --with-scheduler
rq worker  # Add more workers for parallel processing

Option 2: Celery with Redis/RabbitMQ

Celery is more powerful, with scheduling, retries, and monitoring built in:

pip install celery[redis]
# celery_app.py
from celery import Celery
import requests
from bs4 import BeautifulSoup

app = Celery("scraper", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")

app.conf.update(
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],
    task_acks_late=True,       # Re-queue tasks if worker crashes
    worker_prefetch_multiplier=1,  # One task at a time per worker
    task_default_retry_delay=60,
    task_max_retries=3,
)

@app.task(bind=True, max_retries=3)
def scrape_url(self, url: str) -> dict:
    try:
        response = requests.get(url, timeout=30, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
        })
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        title = soup.find("title")
        return {
            "url": url,
            "title": title.text.strip() if title else "",
            "status": response.status_code,
        }
    except Exception as exc:
        raise self.retry(exc=exc)
# submit_tasks.py
from celery_app import scrape_url

urls = [f"https://example.com/page/{i}" for i in range(100)]

# Submit all URLs as tasks
results = []
for url in urls:
    result = scrape_url.delay(url)
    results.append(result)

# Check results
for r in results:
    print(r.get(timeout=120))
# Start Celery workers
celery -A celery_app worker --concurrency=4 --loglevel=info

# Monitor with Flower (optional)
pip install flower
celery -A celery_app flower

Option 3: RabbitMQ with pika

For more advanced routing and message guarantees:

import pika
import json

# Producer: enqueue URLs
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue="scrape_urls", durable=True)

urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
    channel.basic_publish(
        exchange="",
        routing_key="scrape_urls",
        body=json.dumps({"url": url}),
        properties=pika.BasicProperties(delivery_mode=2),  # Persistent
    )
    print(f"Enqueued: {url}")

connection.close()

Comparison

Feature rq Celery RabbitMQ (raw)
Setup complexity Low Medium High
Retries Basic Built-in Manual
Monitoring rq-dashboard Flower Management UI
Scheduling Via rq-scheduler Built-in beat External
Best for Simple jobs Production scrapers Custom routing

Recommendation

Start with rq for simple projects. Move to Celery when you need retries, scheduling, and monitoring. Use RabbitMQ directly only when you need custom message routing patterns.