Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares
The six pieces inside Scrapy and how a request flows through them. Once you can draw this diagram, every Scrapy mystery becomes debuggable.
What you’ll learn
- Name the six core components of Scrapy and what each does.
- Trace a request from spider through engine, downloader, and back.
- Distinguish downloader middleware from spider middleware.
- Identify the right component to extend when you need to add a feature.
Most Scrapy bugs and most "how do I add X?" questions evaporate once you have the architecture diagram in your head. Twenty minutes to internalize this saves weeks of confusion later.
The six components
┌──────────────────────────────────────────────┐
│ ENGINE │
│ (the controller, moves data between parts) │
└──────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
SCHEDULER SPIDERS DOWNLOADER
(request queue) (your code) (HTTP client)
│ │ │
│ ▼ │
│ ITEM PIPELINES │
│ (post-processing) │
│ │ │
│ ▼ │
│ (FEEDS / DB) │
│ │
▼ ▼
SPIDER MIDDLEWARES DOWNLOADER MIDDLEWARES
(between engine ↔ spider) (between engine ↔ downloader)
Six moving parts:
- Engine, the controller. Owns the event loop, hands requests around, fires signals.
- Scheduler, the request queue. Decides what to fetch next. Can be in-memory or persistent (
JOBDIR). - Downloader, the HTTP client. Sends requests, receives responses. Built on Twisted.
- Spiders, your code. Generates initial requests, parses responses, yields more requests or items.
- Item Pipelines, post-processing. Validate, dedupe, enrich, store. Run sequentially.
- Middlewares, pluggable interceptors. Two flavors: downloader middlewares (engine ↔ downloader) and spider middlewares (engine ↔ spider).
The request lifecycle
Trace one URL from start_urls to a row in your database:
- Spider yields a
Request(orstart_requests()does). - Spider middlewares (output side) see the request.
- Engine hands it to the scheduler.
- Scheduler stores it (in memory or on disk). Dupefilter rejects if seen.
- Engine pulls the next request from the scheduler when capacity allows.
- Downloader middlewares (request side) see it, they can modify, drop, or swap with a cached response. Proxy and User-Agent injection happen here.
- Downloader sends the HTTP request, gets the response.
- Downloader middlewares (response side) see it, they can retry, redirect, or pass it on.
RetryMiddlewareandRedirectMiddlewarelive here. - Engine sends the response back to the spider's callback.
- Spider middlewares (input side) see the response on its way in.
- Spider parses, yields items and/or more requests.
- Spider middlewares (output side) see the yielded items and requests.
- Items flow into the item pipelines.
- Pipelines run in order, validation → dedup → enrichment → storage.
That's the entire system. Every Scrapy feature plugs in somewhere on this path.
Downloader vs spider middleware, the constant confusion
Both intercept things. The difference:
- Downloader middleware sits between the engine and the HTTP downloader. It sees raw
RequestandResponseobjects, before any spider code runs. Use it for: proxies, headers, cookies, retries, caching, fingerprint randomization. - Spider middleware sits between the engine and the spider's callbacks. It sees what the spider yields (items, requests) and what the spider receives (responses). Use it for: filtering output, manipulating start_requests, handling spider exceptions.
Rule of thumb: if your concern is HTTP-level (headers, proxies, retries), it's downloader middleware. If your concern is item-level or spider-output filtering, it's spider middleware.
Where to extend for common features
| You want to... | Extend... |
|---|---|
| Rotate User-Agent | Downloader middleware |
| Inject proxies | Downloader middleware |
| Add retry on a custom 429 pattern | Downloader middleware |
| Cache responses to disk | Downloader middleware (already built-in) |
| Skip items missing a required field | Item pipeline |
| Deduplicate items by SKU | Item pipeline |
| Write items to Postgres | Item pipeline |
Filter start_urls based on a DB |
Spider middleware (or override start_requests) |
| Add per-spider stats | Extension (signals) |
Items, requests, and the things spiders yield
A spider callback returns an iterable. Each value is either:
- A
scrapy.Request, goes to the scheduler for fetching. - A
scrapy.Item(or dict), goes to the item pipelines. - A
dict, same as Item. None, ignored.
Mixed yields are normal. A product list page yields product detail requests and a "next page" request and possibly summary items.
def parse_listing(self, response):
for card in response.css(".product-card"):
yield { # summary item, goes to pipelines
"url": card.css("a::attr(href)").get(),
"list_price": card.css(".price::text").get(),
}
yield response.follow( # detail request, goes to scheduler
card.css("a::attr(href)").get(), self.parse_detail
)
if next := response.css("a.next::attr(href)").get():
yield response.follow(next, self.parse_listing)
The engine sorts them out.
The settings.py file is configuration for this architecture
When you set DOWNLOADER_MIDDLEWARES, you're injecting into step 6/8. When you set ITEM_PIPELINES, you're configuring step 13. CONCURRENT_REQUESTS controls how aggressively the engine pulls from the scheduler. Every setting maps to one of the six components.
Signals, the back channel
Signals are the way to react to engine events without subclassing anything. spider_opened, item_scraped, response_received, spider_closed, register a handler in an extension and Scrapy calls you. This is how you build custom metrics, send notifications, or rotate state.
from scrapy import signals
class MyExtension:
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider, reason):
spider.logger.info(f"closed: {reason}")
Hands-on lab
In a scrapy project, set LOG_LEVEL = "DEBUG" and run a small crawl. Watch the log lines:
Crawled (200) <GET ...>, downloader returned a responseFiltered duplicate request, dupefilter caught a repeatScraped from <200 ...>, an item left the spider, headed to pipelines
Each log line corresponds to a transition between two of the six components above. When you can read the log as "engine just moved X from scheduler to downloader" you've internalized the architecture.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.