Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.1intermediate4 min read

Why Scrapy Beats Hand-Rolled Scripts

When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.

What you’ll learn

  • Name three production concerns Scrapy handles that hand-rolled `requests` scripts don't.
  • Compare a 50-line `requests` scraper with the equivalent Scrapy project structure.
  • Decide between Scrapy, plain `requests`, and `httpx` async for a given scale.

You've spent four sub-paths writing scrapers with requests, BeautifulSoup, Playwright, and raw API calls. They work. So why does most production scraping happen inside Scrapy?

Because at scale, the things that break aren't the parsing, they're concurrency, retries, throttling, deduplication, and graceful shutdown. A framework that solves those once, well, beats a thousand bespoke scripts that solve each one badly.

The pain Scrapy was built to remove

Imagine you've shipped a requests-based scraper. It runs nightly, hits 50k product pages. Six weeks in, the requirements list looks like this:

  • Run 20 requests in parallel, but no more than 4 to any single host.
  • Retry transient 5xx errors up to 3 times with exponential backoff.
  • Honour robots.txt.
  • Rotate User-Agent on every request.
  • Skip URLs you've already scraped (a persistent fingerprint queue).
  • Validate every item against a schema. Drop invalid items, log why.
  • Write results to Postgres and a CSV simultaneously.
  • Stop cleanly on Ctrl+C, finishing in-flight requests but draining the queue.
  • Resume from where you left off after a crash.
  • Expose live metrics on a /stats endpoint.

Every one of those is a feature you'd otherwise hand-build, test, and maintain. Scrapy ships all of them as configuration or one-line middleware additions.

The architectural gap

Here's the same problem, side by side.

Hand-rolled

import requests
from bs4 import BeautifulSoup

seen = set()
for url in urls:
  if url in seen:
  continue
  try:
  r = requests.get(url, timeout=10)
  r.raise_for_status()
  except requests.RequestException:
  continue
  soup = BeautifulSoup(r.text, "lxml")
  item = {"title": soup.select_one("h1").text}
  seen.add(url)
  print(item)

This is fine for 100 URLs. It is not fine for 50k. There is no concurrency, no retry policy, no per-host throttling, no persistent dedup, no schema validation, no clean output pipeline. Add those one by one and you've reinvented Scrapy badly.

Scrapy

# spider.py
import scrapy

class ProductSpider(scrapy.Spider):
  name = "products"
  start_urls = ["https://practice.scrapingcentral.com/products"]

  def parse(self, response):
  for href in response.css("a.product-card::attr(href)").getall():
  yield response.follow(href, self.parse_product)
  next_page = response.css("a.next::attr(href)").get()
  if next_page:
  yield response.follow(next_page, self.parse)

  def parse_product(self, response):
  yield {
  "url": response.url,
  "title": response.css("h1::text").get(),
  "price": response.css(".price::text").get(),
  }

That's the whole spider. Concurrency, retries, throttling, deduplication, JSON/CSV output, robots.txt, logs, and stats are configured in settings.py:

# settings.py
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.25
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 3
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
FEEDS = {"products.jsonl": {"format": "jsonlines"}}

Six lines turn the spider from "works on my laptop" to "production-grade." No framework gives you something for nothing, but the trade Scrapy asks (learn its conventions) returns more leverage than almost anything else in the scraping world.

What Scrapy actually gives you

Concern Scrapy's answer
Concurrency Twisted reactor; configurable parallelism per spider, per domain, per IP
Politeness DOWNLOAD_DELAY, AUTOTHROTTLE_ENABLED, ROBOTSTXT_OBEY, all settings
Retries RetryMiddleware, configurable codes and backoff
Caching HttpCacheMiddleware for offline development
Throttling AutoThrottle adapts request rate to response latency
Items Typed Item classes or simple dicts; ItemLoader for normalization
Pipelines Chain processors: dedup → validate → enrich → store
Output Built-in feed exporters for JSON, JSONL, CSV, XML, S3, GCS
Resume Persistent JOBDIR checkpoints scheduler + dupefilter to disk
Observability Built-in stats collector, Telnet console, Scrapyd UI

When NOT to use Scrapy

Scrapy isn't always the answer:

  • One-off scripts. If you'll run it twice, requests + a script is fine.
  • Pure browser automation. If 100% of pages need real JavaScript, Playwright is more direct (though scrapy-playwright bridges the gap; we cover it in §4.7).
  • Pure API hammering. If you're just hitting JSON endpoints at high concurrency with no HTML parsing, httpx + asyncio (covered in §4.21) is leaner.
  • You hate Twisted. Scrapy's async model is older than asyncio and shows it in places. The friction is real, but small compared to the wins.

What we'll build through §4.1–§4.7

By the end of the Scrapy lessons you'll have:

  • A multi-spider project against Catalog108 /products.
  • Item pipelines that validate, deduplicate, and store to Postgres.
  • Custom middleware for proxies, headers, and cookie sessions.
  • A scrapy-playwright spider for the SPA challenge at /challenges/dynamic/spa-pure.

Then in §4.8–§4.20 we'll cover the equivalent in PHP, Symfony, Roach, Goutte, because production scraping in 2026 is multi-language.

What to try

Install Scrapy (pip install scrapy) and run:

scrapy startproject catalog108
cd catalog108
scrapy genspider products practice.scrapingcentral.com

Open the generated spiders/products.py. Notice how much is there already, middlewares, settings, items, pipelines, all stubbed in. The "blank project" already encodes more architectural wisdom than most hand-rolled scrapers ever acquire.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Why Scrapy Beats Hand-Rolled Scripts1 / 8

Which production concern is NOT something Scrapy provides out of the box?

Score so far: 0 / 0