Scrapy Middleware and Pipelines
Customize Scrapy's request/response flow with middleware and process scraped data using item pipelines for validation, cleaning, and storage.
Middleware and pipelines are the backbone of Scrapy's extensibility. Middleware intercepts requests and responses, while pipelines process items after extraction.
Downloader Middleware
Downloader middleware sits between the engine and the downloader, letting you modify requests before they are sent and responses before they reach your spider.
Rotating User Agents
# middlewares.py
import random
class RandomUserAgentMiddleware:
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.user_agents)
Integrating ScraperAPI as Middleware
class ScraperAPIProxyMiddleware:
def __init__(self, api_key):
self.api_key = api_key
@classmethod
def from_crawler(cls, crawler):
return cls(api_key=crawler.settings.get("SCRAPERAPI_KEY"))
def process_request(self, request, spider):
original_url = request.url
request = request.replace(
url=f"http://api.scraperapi.com/?api_key={self.api_key}&url={original_url}"
)
return request
Enable middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
"quotescraper.middlewares.RandomUserAgentMiddleware": 400,
"quotescraper.middlewares.ScraperAPIProxyMiddleware": 350,
}
SCRAPERAPI_KEY = "YOUR_API_KEY"
Item Pipelines
Pipelines process every item yielded by your spiders. Use them for validation, cleaning, deduplication, and storage.
Cleaning and Validating Data
# pipelines.py
from scrapy.exceptions import DropItem
class CleanPricePipeline:
def process_item(self, item, spider):
if item.get("price"):
raw = item["price"].replace("£", "").replace("$", "").strip()
item["price"] = float(raw)
return item
class DuplicateFilterPipeline:
def __init__(self):
self.seen = set()
def process_item(self, item, spider):
key = item.get("name", "")
if key in self.seen:
raise DropItem(f"Duplicate item: {key}")
self.seen.add(key)
return item
class RequiredFieldsPipeline:
required = ["name", "price"]
def process_item(self, item, spider):
for field in self.required:
if not item.get(field):
raise DropItem(f"Missing required field: {field}")
return item
Enable pipelines in settings.py:
ITEM_PIPELINES = {
"quotescraper.pipelines.CleanPricePipeline": 100,
"quotescraper.pipelines.DuplicateFilterPipeline": 200,
"quotescraper.pipelines.RequiredFieldsPipeline": 300,
}
The number (100, 200, 300) determines the order, lower numbers run first.
Pipeline Execution Order
| Priority | Pipeline | Purpose |
|---|---|---|
| 100 | CleanPricePipeline | Clean and normalize data |
| 200 | DuplicateFilterPipeline | Remove duplicate items |
| 300 | RequiredFieldsPipeline | Validate required fields |
Tips
- Keep each pipeline focused on one responsibility.
- Use
DropItemto remove bad data early in the pipeline chain. - Middleware is the right place to add proxy rotation, header management, and retry logic.
- Services like ScrapingAnt can also be integrated via middleware for JavaScript rendering and proxy rotation.
Next Steps
- Build a database storage pipeline to save items to SQLite or PostgreSQL
- Explore spider middleware for filtering or modifying spider output