Why Scrapy Beats Hand-Rolled Scripts
When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.
What you’ll learn
- Name three production concerns Scrapy handles that hand-rolled `requests` scripts don't.
- Compare a 50-line `requests` scraper with the equivalent Scrapy project structure.
- Decide between Scrapy, plain `requests`, and `httpx` async for a given scale.
You've spent four sub-paths writing scrapers with requests, BeautifulSoup, Playwright, and raw API calls. They work. So why does most production scraping happen inside Scrapy?
Because at scale, the things that break aren't the parsing, they're concurrency, retries, throttling, deduplication, and graceful shutdown. A framework that solves those once, well, beats a thousand bespoke scripts that solve each one badly.
The pain Scrapy was built to remove
Imagine you've shipped a requests-based scraper. It runs nightly, hits 50k product pages. Six weeks in, the requirements list looks like this:
- Run 20 requests in parallel, but no more than 4 to any single host.
- Retry transient 5xx errors up to 3 times with exponential backoff.
- Honour
robots.txt. - Rotate User-Agent on every request.
- Skip URLs you've already scraped (a persistent fingerprint queue).
- Validate every item against a schema. Drop invalid items, log why.
- Write results to Postgres and a CSV simultaneously.
- Stop cleanly on Ctrl+C, finishing in-flight requests but draining the queue.
- Resume from where you left off after a crash.
- Expose live metrics on a
/statsendpoint.
Every one of those is a feature you'd otherwise hand-build, test, and maintain. Scrapy ships all of them as configuration or one-line middleware additions.
The architectural gap
Here's the same problem, side by side.
Hand-rolled
import requests
from bs4 import BeautifulSoup
seen = set()
for url in urls:
if url in seen:
continue
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except requests.RequestException:
continue
soup = BeautifulSoup(r.text, "lxml")
item = {"title": soup.select_one("h1").text}
seen.add(url)
print(item)
This is fine for 100 URLs. It is not fine for 50k. There is no concurrency, no retry policy, no per-host throttling, no persistent dedup, no schema validation, no clean output pipeline. Add those one by one and you've reinvented Scrapy badly.
Scrapy
# spider.py
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://practice.scrapingcentral.com/products"]
def parse(self, response):
for href in response.css("a.product-card::attr(href)").getall():
yield response.follow(href, self.parse_product)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_product(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
That's the whole spider. Concurrency, retries, throttling, deduplication, JSON/CSV output, robots.txt, logs, and stats are configured in settings.py:
# settings.py
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.25
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 3
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
FEEDS = {"products.jsonl": {"format": "jsonlines"}}
Six lines turn the spider from "works on my laptop" to "production-grade." No framework gives you something for nothing, but the trade Scrapy asks (learn its conventions) returns more leverage than almost anything else in the scraping world.
What Scrapy actually gives you
| Concern | Scrapy's answer |
|---|---|
| Concurrency | Twisted reactor; configurable parallelism per spider, per domain, per IP |
| Politeness | DOWNLOAD_DELAY, AUTOTHROTTLE_ENABLED, ROBOTSTXT_OBEY, all settings |
| Retries | RetryMiddleware, configurable codes and backoff |
| Caching | HttpCacheMiddleware for offline development |
| Throttling | AutoThrottle adapts request rate to response latency |
| Items | Typed Item classes or simple dicts; ItemLoader for normalization |
| Pipelines | Chain processors: dedup → validate → enrich → store |
| Output | Built-in feed exporters for JSON, JSONL, CSV, XML, S3, GCS |
| Resume | Persistent JOBDIR checkpoints scheduler + dupefilter to disk |
| Observability | Built-in stats collector, Telnet console, Scrapyd UI |
When NOT to use Scrapy
Scrapy isn't always the answer:
- One-off scripts. If you'll run it twice,
requests+ a script is fine. - Pure browser automation. If 100% of pages need real JavaScript, Playwright is more direct (though
scrapy-playwrightbridges the gap; we cover it in §4.7). - Pure API hammering. If you're just hitting JSON endpoints at high concurrency with no HTML parsing,
httpx+asyncio(covered in §4.21) is leaner. - You hate Twisted. Scrapy's async model is older than
asyncioand shows it in places. The friction is real, but small compared to the wins.
What we'll build through §4.1–§4.7
By the end of the Scrapy lessons you'll have:
- A multi-spider project against Catalog108
/products. - Item pipelines that validate, deduplicate, and store to Postgres.
- Custom middleware for proxies, headers, and cookie sessions.
- A
scrapy-playwrightspider for the SPA challenge at/challenges/dynamic/spa-pure.
Then in §4.8–§4.20 we'll cover the equivalent in PHP, Symfony, Roach, Goutte, because production scraping in 2026 is multi-language.
What to try
Install Scrapy (pip install scrapy) and run:
scrapy startproject catalog108
cd catalog108
scrapy genspider products practice.scrapingcentral.com
Open the generated spiders/products.py. Notice how much is there already, middlewares, settings, items, pipelines, all stubbed in. The "blank project" already encodes more architectural wisdom than most hand-rolled scrapers ever acquire.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.