Custom Middlewares for Headers, Proxies, Cookies, Production, Scale & Career

Three middleware patterns every production Scrapy project ships: User-Agent rotation, proxy injection, cookie session management.

Downloader middleware is the layer where Scrapy's network behavior becomes pluggable. Three patterns cover 90% of what production scrapers do here.

The middleware contract

class MyMiddleware:
  @classmethod
  def from_crawler(cls, crawler):
  return cls(crawler.settings)

  def __init__(self, settings):
  ...

  def process_request(self, request, spider):
  # Called for every outbound request.
  # Return None → continue to next middleware
  # Return Response → short-circuit (cached response)
  # Return Request → replace the request
  # Raise IgnoreRequest → drop
  return None

  def process_response(self, request, response, spider):
  # Called for every response.
  # Return response → continue
  # Return Request → re-schedule (e.g. retry)
  # Raise IgnoreRequest → drop
  return response

  def process_exception(self, request, exception, spider):
  # Called on download exceptions.
  ...

Enable in settings.py:

DOWNLOADER_MIDDLEWARES = {
  "myproject.middlewares.UserAgentRotator": 400,
  "myproject.middlewares.ProxyMiddleware": 410,
  "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,  # disable default
}

None disables a default middleware, handy when replacing the built-in UA middleware.

User-Agent rotation

The naive version picks randomly per request:

import random

UAS = [
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]

class UserAgentRotator:
  def process_request(self, request, spider):
  request.headers["User-Agent"] = random.choice(UAS)

This is worse than nothing for serious anti-bot. A random Chrome 120 paired with Safari headers paired with a Linux TLS fingerprint screams "scraper." Better: keep a UA pinned per session.

class StickyUserAgentMiddleware:
  """One UA per cookiejar/session, consistent fingerprint."""

  def __init__(self):
  self.session_ua = {}

  def process_request(self, request, spider):
  key = request.meta.get("cookiejar", "default")
  if key not in self.session_ua:
  self.session_ua[key] = random.choice(UAS)
  request.headers["User-Agent"] = self.session_ua[key]

Now Session A is "Safari on Mac" for its entire lifecycle. Session B is "Chrome on Windows." Sites see consistent fingerprints.

We cover full header fingerprinting (Sec-CH-UA, Accept-Language, Sec-Fetch-*) in §4.33.

Proxy injection from a pool

The basic version reads from request.meta["proxy"]:

class ProxyMiddleware:
  def __init__(self, settings):
  self.pool = settings.getlist("PROXY_POOL")
  self.idx = 0

  @classmethod
  def from_crawler(cls, crawler):
  return cls(crawler.settings)

  def process_request(self, request, spider):
  if "proxy" not in request.meta:
  request.meta["proxy"] = self.pool[self.idx % len(self.pool)]
  self.idx += 1

With authenticated proxies use the http://user:pass@host:port form.

Retry on proxy failure

When a proxy returns 407, 502, or times out, swap it and retry:

from scrapy.exceptions import IgnoreRequest

BAD_STATUS = {407, 502, 503}

class ProxyFailoverMiddleware:
  def __init__(self, settings):
  self.pool = list(settings.getlist("PROXY_POOL"))
  self.dead = set()

  @classmethod
  def from_crawler(cls, crawler):
  return cls(crawler.settings)

  def _live(self):
  return [p for p in self.pool if p not in self.dead] or self.pool

  def process_request(self, request, spider):
  if "proxy" not in request.meta:
  request.meta["proxy"] = random.choice(self._live())

  def process_response(self, request, response, spider):
  if response.status in BAD_STATUS:
  self.dead.add(request.meta.get("proxy"))
  request.meta.pop("proxy", None)
  request.dont_filter = True
  return request  # re-schedule
  return response

  def process_exception(self, request, exception, spider):
  self.dead.add(request.meta.get("proxy"))
  request.meta.pop("proxy", None)
  request.dont_filter = True
  return request

dont_filter = True bypasses the dupefilter so the retry actually goes through. The dead-set could expire after N minutes, see §4.30 for a full pool manager.

Cookie session management

Scrapy's built-in CookiesMiddleware handles cookies. The piece you usually customize is which session a request belongs to. Use meta["cookiejar"]:

def start_requests(self):
  for i in range(5):
  yield scrapy.Request(
  "https://practice.scrapingcentral.com/login",
  meta={"cookiejar": f"session_{i}"},
  callback=self.login,
  )

Each cookiejar value gets its own jar. Subsequent requests carrying the same cookiejar reuse the cookies. This is the cleanest way to run multiple logged-in identities in one spider.

Refreshing expired sessions

A middleware can detect logged-out responses and re-login:

class SessionRefreshMiddleware:
  def process_response(self, request, response, spider):
  if "Please log in" in response.text and not request.meta.get("re_login"):
  yield scrapy.Request(
  "https://target.com/login",
  meta={"cookiejar": request.meta.get("cookiejar"), "re_login": True, "next_url": request.url},
  callback=self._login_again,
  dont_filter=True,
  )
  return response

The full pattern (re-auth flow, queue the original request, retry once cookies refresh) is more involved, covered in the auth chapters of Sub-Path 4's distributed lessons.

Middleware ordering

The order matters. Scrapy's defaults run at known priorities:

Middleware	Priority
HttpAuthMiddleware	300
DownloadTimeoutMiddleware	350
UserAgentMiddleware	400
RetryMiddleware	550
HttpProxyMiddleware	750
CookiesMiddleware	700
HttpCompressionMiddleware	590

Your custom UA rotator goes at 400 (or replace the default). Your proxy middleware goes near 750. If your middleware needs to see the final outbound state, put it after the others.

Hands-on lab

Against /challenges/antibot/header-fingerprint at Catalog108:

Write a StickyUserAgentMiddleware that pins one UA per session.
Write a matching Accept-Language and Sec-CH-UA setter so the fingerprint is consistent (Safari on Mac doesn't send Sec-CH-UA; Chrome does, make sure your headers agree with your UA).
Run the spider and observe: the challenge endpoint passes when fingerprints are coherent, fails when they conflict.

Coherent fingerprints beat random ones, always.

Custom Middlewares for Headers, Proxies, Cookies

What you’ll learn

The middleware contract

User-Agent rotation

Proxy injection from a pool

Retry on proxy failure

Cookie session management

Refreshing expired sessions

Middleware ordering

Hands-on lab

Hands-on lab

Quiz, check your understanding

Inside `process_request`, what does returning `None` mean?