Why Scrapy Beats Hand-Rolled Scripts, Production, Scale & Career

When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.

You've spent four sub-paths writing scrapers with requests, BeautifulSoup, Playwright, and raw API calls. They work. So why does most production scraping happen inside Scrapy?

Because at scale, the things that break aren't the parsing, they're concurrency, retries, throttling, deduplication, and graceful shutdown. A framework that solves those once, well, beats a thousand bespoke scripts that solve each one badly.

The pain Scrapy was built to remove

Imagine you've shipped a requests-based scraper. It runs nightly, hits 50k product pages. Six weeks in, the requirements list looks like this:

Run 20 requests in parallel, but no more than 4 to any single host.
Retry transient 5xx errors up to 3 times with exponential backoff.
Honour robots.txt.
Rotate User-Agent on every request.
Skip URLs you've already scraped (a persistent fingerprint queue).
Validate every item against a schema. Drop invalid items, log why.
Write results to Postgres and a CSV simultaneously.
Stop cleanly on Ctrl+C, finishing in-flight requests but draining the queue.
Resume from where you left off after a crash.
Expose live metrics on a /stats endpoint.

Every one of those is a feature you'd otherwise hand-build, test, and maintain. Scrapy ships all of them as configuration or one-line middleware additions.

The architectural gap

Here's the same problem, side by side.

Hand-rolled

import requests
from bs4 import BeautifulSoup

seen = set()
for url in urls:
  if url in seen:
  continue
  try:
  r = requests.get(url, timeout=10)
  r.raise_for_status()
  except requests.RequestException:
  continue
  soup = BeautifulSoup(r.text, "lxml")
  item = {"title": soup.select_one("h1").text}
  seen.add(url)
  print(item)

This is fine for 100 URLs. It is not fine for 50k. There is no concurrency, no retry policy, no per-host throttling, no persistent dedup, no schema validation, no clean output pipeline. Add those one by one and you've reinvented Scrapy badly.

Scrapy

# spider.py
import scrapy

class ProductSpider(scrapy.Spider):
  name = "products"
  start_urls = ["https://practice.scrapingcentral.com/products"]

  def parse(self, response):
  for href in response.css("a.product-card::attr(href)").getall():
  yield response.follow(href, self.parse_product)
  next_page = response.css("a.next::attr(href)").get()
  if next_page:
  yield response.follow(next_page, self.parse)

  def parse_product(self, response):
  yield {
  "url": response.url,
  "title": response.css("h1::text").get(),
  "price": response.css(".price::text").get(),
  }

That's the whole spider. Concurrency, retries, throttling, deduplication, JSON/CSV output, robots.txt, logs, and stats are configured in settings.py:

# settings.py
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.25
AUTOTHROTTLE_ENABLED = True
RETRY_TIMES = 3
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
FEEDS = {"products.jsonl": {"format": "jsonlines"}}

Six lines turn the spider from "works on my laptop" to "production-grade." No framework gives you something for nothing, but the trade Scrapy asks (learn its conventions) returns more leverage than almost anything else in the scraping world.

What Scrapy actually gives you

Concern	Scrapy's answer
Concurrency	Twisted reactor; configurable parallelism per spider, per domain, per IP
Politeness	`DOWNLOAD_DELAY`, `AUTOTHROTTLE_ENABLED`, `ROBOTSTXT_OBEY`, all settings
Retries	`RetryMiddleware`, configurable codes and backoff
Caching	`HttpCacheMiddleware` for offline development
Throttling	AutoThrottle adapts request rate to response latency
Items	Typed `Item` classes or simple dicts; `ItemLoader` for normalization
Pipelines	Chain processors: dedup → validate → enrich → store
Output	Built-in feed exporters for JSON, JSONL, CSV, XML, S3, GCS
Resume	Persistent `JOBDIR` checkpoints scheduler + dupefilter to disk
Observability	Built-in `stats` collector, Telnet console, Scrapyd UI

When NOT to use Scrapy

Scrapy isn't always the answer:

One-off scripts. If you'll run it twice, requests + a script is fine.
Pure browser automation. If 100% of pages need real JavaScript, Playwright is more direct (though scrapy-playwright bridges the gap; we cover it in §4.7).
Pure API hammering. If you're just hitting JSON endpoints at high concurrency with no HTML parsing, httpx + asyncio (covered in §4.21) is leaner.
You hate Twisted. Scrapy's async model is older than asyncio and shows it in places. The friction is real, but small compared to the wins.

What we'll build through §4.1–§4.7

By the end of the Scrapy lessons you'll have:

A multi-spider project against Catalog108 /products.
Item pipelines that validate, deduplicate, and store to Postgres.
Custom middleware for proxies, headers, and cookie sessions.
A scrapy-playwright spider for the SPA challenge at /challenges/dynamic/spa-pure.

Then in §4.8–§4.20 we'll cover the equivalent in PHP, Symfony, Roach, Goutte, because production scraping in 2026 is multi-language.

What to try

Install Scrapy (pip install scrapy) and run:

scrapy startproject catalog108
cd catalog108
scrapy genspider products practice.scrapingcentral.com

Open the generated spiders/products.py. Notice how much is there already, middlewares, settings, items, pipelines, all stubbed in. The "blank project" already encodes more architectural wisdom than most hand-rolled scrapers ever acquire.

Why Scrapy Beats Hand-Rolled Scripts

What you’ll learn