Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares, Production, Scale & Career

The six pieces inside Scrapy and how a request flows through them. Once you can draw this diagram, every Scrapy mystery becomes debuggable.

Most Scrapy bugs and most "how do I add X?" questions evaporate once you have the architecture diagram in your head. Twenty minutes to internalize this saves weeks of confusion later.

The six components

  ┌──────────────────────────────────────────────┐
  │  ENGINE  │
  │  (the controller, moves data between parts) │
  └──────────────────────────────────────────────┘
  │  │  │
  ▼  ▼  ▼
  SCHEDULER  SPIDERS  DOWNLOADER
  (request queue)  (your code)  (HTTP client)
  │  │  │
  │  ▼  │
  │  ITEM PIPELINES  │
  │  (post-processing)  │
  │  │  │
  │  ▼  │
  │  (FEEDS / DB)  │
  │  │
  ▼  ▼
  SPIDER MIDDLEWARES  DOWNLOADER MIDDLEWARES
  (between engine ↔ spider)  (between engine ↔ downloader)

Six moving parts:

Engine, the controller. Owns the event loop, hands requests around, fires signals.
Scheduler, the request queue. Decides what to fetch next. Can be in-memory or persistent (JOBDIR).
Downloader, the HTTP client. Sends requests, receives responses. Built on Twisted.
Spiders, your code. Generates initial requests, parses responses, yields more requests or items.
Item Pipelines, post-processing. Validate, dedupe, enrich, store. Run sequentially.
Middlewares, pluggable interceptors. Two flavors: downloader middlewares (engine ↔ downloader) and spider middlewares (engine ↔ spider).

The request lifecycle

Trace one URL from start_urls to a row in your database:

Spider yields a Request (or start_requests() does).
Spider middlewares (output side) see the request.
Engine hands it to the scheduler.
Scheduler stores it (in memory or on disk). Dupefilter rejects if seen.
Engine pulls the next request from the scheduler when capacity allows.
Downloader middlewares (request side) see it, they can modify, drop, or swap with a cached response. Proxy and User-Agent injection happen here.
Downloader sends the HTTP request, gets the response.
Downloader middlewares (response side) see it, they can retry, redirect, or pass it on. RetryMiddleware and RedirectMiddleware live here.
Engine sends the response back to the spider's callback.
Spider middlewares (input side) see the response on its way in.
Spider parses, yields items and/or more requests.
Spider middlewares (output side) see the yielded items and requests.
Items flow into the item pipelines.
Pipelines run in order, validation → dedup → enrichment → storage.

That's the entire system. Every Scrapy feature plugs in somewhere on this path.

Downloader vs spider middleware, the constant confusion

Both intercept things. The difference:

Downloader middleware sits between the engine and the HTTP downloader. It sees raw Request and Response objects, before any spider code runs. Use it for: proxies, headers, cookies, retries, caching, fingerprint randomization.
Spider middleware sits between the engine and the spider's callbacks. It sees what the spider yields (items, requests) and what the spider receives (responses). Use it for: filtering output, manipulating start_requests, handling spider exceptions.

Rule of thumb: if your concern is HTTP-level (headers, proxies, retries), it's downloader middleware. If your concern is item-level or spider-output filtering, it's spider middleware.

Where to extend for common features

You want to...	Extend...
Rotate User-Agent	Downloader middleware
Inject proxies	Downloader middleware
Add retry on a custom 429 pattern	Downloader middleware
Cache responses to disk	Downloader middleware (already built-in)
Skip items missing a required field	Item pipeline
Deduplicate items by SKU	Item pipeline
Write items to Postgres	Item pipeline
Filter `start_urls` based on a DB	Spider middleware (or override `start_requests`)
Add per-spider stats	Extension (signals)

Items, requests, and the things spiders yield

A spider callback returns an iterable. Each value is either:

A scrapy.Request, goes to the scheduler for fetching.
A scrapy.Item (or dict), goes to the item pipelines.
A dict, same as Item.
None, ignored.

Mixed yields are normal. A product list page yields product detail requests and a "next page" request and possibly summary items.

def parse_listing(self, response):
  for card in response.css(".product-card"):
  yield {  # summary item, goes to pipelines
  "url": card.css("a::attr(href)").get(),
  "list_price": card.css(".price::text").get(),
  }
  yield response.follow(  # detail request, goes to scheduler
  card.css("a::attr(href)").get(), self.parse_detail
  )
  if next := response.css("a.next::attr(href)").get():
  yield response.follow(next, self.parse_listing)

The engine sorts them out.

The settings.py file is configuration for this architecture

When you set DOWNLOADER_MIDDLEWARES, you're injecting into step 6/8. When you set ITEM_PIPELINES, you're configuring step 13. CONCURRENT_REQUESTS controls how aggressively the engine pulls from the scheduler. Every setting maps to one of the six components.

Signals, the back channel

Signals are the way to react to engine events without subclassing anything. spider_opened, item_scraped, response_received, spider_closed, register a handler in an extension and Scrapy calls you. This is how you build custom metrics, send notifications, or rotate state.

from scrapy import signals

class MyExtension:
  @classmethod
  def from_crawler(cls, crawler):
  ext = cls()
  crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
  return ext

  def spider_closed(self, spider, reason):
  spider.logger.info(f"closed: {reason}")

Hands-on lab

In a scrapy project, set LOG_LEVEL = "DEBUG" and run a small crawl. Watch the log lines:

Crawled (200) <GET ...>, downloader returned a response
Filtered duplicate request, dupefilter caught a repeat
Scraped from <200 ...>, an item left the spider, headed to pipelines

Each log line corresponds to a transition between two of the six components above. When you can read the log as "engine just moved X from scheduler to downloader" you've internalized the architecture.

Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares

What you’ll learn