Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scrapy Middleware and Pipelines

Customize Scrapy's request/response flow with middleware and process scraped data using item pipelines for validation, cleaning, and storage.

Python Scraping · #6intermediate2 min read
Share:WhatsAppLinkedIn

Middleware and pipelines are the backbone of Scrapy's extensibility. Middleware intercepts requests and responses, while pipelines process items after extraction.

Downloader Middleware

Downloader middleware sits between the engine and the downloader, letting you modify requests before they are sent and responses before they reach your spider.

Rotating User Agents

# middlewares.py
import random


class RandomUserAgentMiddleware:
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    ]

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.user_agents)

Integrating ScraperAPI as Middleware

class ScraperAPIProxyMiddleware:
    def __init__(self, api_key):
        self.api_key = api_key

    @classmethod
    def from_crawler(cls, crawler):
        return cls(api_key=crawler.settings.get("SCRAPERAPI_KEY"))

    def process_request(self, request, spider):
        original_url = request.url
        request = request.replace(
            url=f"http://api.scraperapi.com/?api_key={self.api_key}&url={original_url}"
        )
        return request

Enable middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "quotescraper.middlewares.RandomUserAgentMiddleware": 400,
    "quotescraper.middlewares.ScraperAPIProxyMiddleware": 350,
}
SCRAPERAPI_KEY = "YOUR_API_KEY"

Item Pipelines

Pipelines process every item yielded by your spiders. Use them for validation, cleaning, deduplication, and storage.

Cleaning and Validating Data

# pipelines.py
from scrapy.exceptions import DropItem


class CleanPricePipeline:
    def process_item(self, item, spider):
        if item.get("price"):
            raw = item["price"].replace("£", "").replace("$", "").strip()
            item["price"] = float(raw)
        return item


class DuplicateFilterPipeline:
    def __init__(self):
        self.seen = set()

    def process_item(self, item, spider):
        key = item.get("name", "")
        if key in self.seen:
            raise DropItem(f"Duplicate item: {key}")
        self.seen.add(key)
        return item


class RequiredFieldsPipeline:
    required = ["name", "price"]

    def process_item(self, item, spider):
        for field in self.required:
            if not item.get(field):
                raise DropItem(f"Missing required field: {field}")
        return item

Enable pipelines in settings.py:

ITEM_PIPELINES = {
    "quotescraper.pipelines.CleanPricePipeline": 100,
    "quotescraper.pipelines.DuplicateFilterPipeline": 200,
    "quotescraper.pipelines.RequiredFieldsPipeline": 300,
}

The number (100, 200, 300) determines the order, lower numbers run first.

Pipeline Execution Order

Priority Pipeline Purpose
100 CleanPricePipeline Clean and normalize data
200 DuplicateFilterPipeline Remove duplicate items
300 RequiredFieldsPipeline Validate required fields

Tips

  • Keep each pipeline focused on one responsibility.
  • Use DropItem to remove bad data early in the pipeline chain.
  • Middleware is the right place to add proxy rotation, header management, and retry logic.
  • Services like ScrapingAnt can also be integrated via middleware for JavaScript rendering and proxy rotation.

Next Steps

  • Build a database storage pipeline to save items to SQLite or PostgreSQL
  • Explore spider middleware for filtering or modifying spider output