Using Message Queues for Scraping (Redis, RabbitMQ)
Learn how to use Redis and RabbitMQ message queues to build reliable, distributed web scraping systems.
Message queues decouple URL discovery from scraping. One process finds URLs to scrape, another process does the scraping. This makes your system more reliable, scalable, and easier to maintain.
Why Use a Queue?
- Reliability, if a worker crashes, the URL stays in the queue
- Scalability, add more workers without changing code
- Rate control, workers consume at their own pace
- Retry logic, failed URLs go back to the queue
- Priority, important URLs get scraped first
Option 1: Redis Queue with rq
rq (Redis Queue) is the simplest Python task queue:
pip install rq redis
# tasks.py
import requests
from bs4 import BeautifulSoup
import json
def scrape_url(url: str) -> dict:
"""Task that scrapes a single URL."""
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
})
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("title")
result = {
"url": url,
"status": response.status_code,
"title": title.text.strip() if title else "",
}
# Save result
with open(f"/tmp/scrape_{hash(url)}.json", "w") as f:
json.dump(result, f)
return result
# enqueue_urls.py
from redis import Redis
from rq import Queue
from tasks import scrape_url
redis_conn = Redis()
q = Queue(connection=redis_conn)
urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
]
for url in urls:
job = q.enqueue(scrape_url, url, retry=3)
print(f"Enqueued: {url} (job {job.id})")
print(f"Queue length: {len(q)}")
# Start workers (run in separate terminals or as services)
rq worker --with-scheduler
rq worker # Add more workers for parallel processing
Option 2: Celery with Redis/RabbitMQ
Celery is more powerful, with scheduling, retries, and monitoring built in:
pip install celery[redis]
# celery_app.py
from celery import Celery
import requests
from bs4 import BeautifulSoup
app = Celery("scraper", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
app.conf.update(
task_serializer="json",
result_serializer="json",
accept_content=["json"],
task_acks_late=True, # Re-queue tasks if worker crashes
worker_prefetch_multiplier=1, # One task at a time per worker
task_default_retry_delay=60,
task_max_retries=3,
)
@app.task(bind=True, max_retries=3)
def scrape_url(self, url: str) -> dict:
try:
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
})
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("title")
return {
"url": url,
"title": title.text.strip() if title else "",
"status": response.status_code,
}
except Exception as exc:
raise self.retry(exc=exc)
# submit_tasks.py
from celery_app import scrape_url
urls = [f"https://example.com/page/{i}" for i in range(100)]
# Submit all URLs as tasks
results = []
for url in urls:
result = scrape_url.delay(url)
results.append(result)
# Check results
for r in results:
print(r.get(timeout=120))
# Start Celery workers
celery -A celery_app worker --concurrency=4 --loglevel=info
# Monitor with Flower (optional)
pip install flower
celery -A celery_app flower
Option 3: RabbitMQ with pika
For more advanced routing and message guarantees:
import pika
import json
# Producer: enqueue URLs
connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()
channel.queue_declare(queue="scrape_urls", durable=True)
urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
channel.basic_publish(
exchange="",
routing_key="scrape_urls",
body=json.dumps({"url": url}),
properties=pika.BasicProperties(delivery_mode=2), # Persistent
)
print(f"Enqueued: {url}")
connection.close()
Comparison
| Feature | rq | Celery | RabbitMQ (raw) |
|---|---|---|---|
| Setup complexity | Low | Medium | High |
| Retries | Basic | Built-in | Manual |
| Monitoring | rq-dashboard | Flower | Management UI |
| Scheduling | Via rq-scheduler | Built-in beat | External |
| Best for | Simple jobs | Production scrapers | Custom routing |
Recommendation
Start with rq for simple projects. Move to Celery when you need retries, scheduling, and monitoring. Use RabbitMQ directly only when you need custom message routing patterns.