CrawlSpider, SitemapSpider, and Other Specialized Spiders, Production, Scale & Career

Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.

Plain scrapy.Spider is the manual transmission. Specialized subclasses are the automatic. Each is a good fit for a specific crawl shape.

The four built-in spider classes

Class	What it's for
`scrapy.Spider`	Generic, you control link-following entirely
`CrawlSpider`	Whole-site crawls via declarative rules
`SitemapSpider`	Crawl from sitemap.xml entries
`XMLFeedSpider` / `CSVFeedSpider`	Parse XML or CSV feeds directly

CrawlSpider, rule-based traversal

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class CatalogCrawler(CrawlSpider):
  name = "catalog_crawler"
  allowed_domains = ["practice.scrapingcentral.com"]
  start_urls = ["https://practice.scrapingcentral.com/products"]

  rules = (
  # Follow listing pagination, don't extract items from list pages
  Rule(LinkExtractor(restrict_css="a.next"), follow=True),
  # Follow links to product detail pages, parse with parse_product
  Rule(
  LinkExtractor(restrict_css=".product-card a"),
  callback="parse_product",
  follow=False,
  ),
  )

  def parse_product(self, response):
  yield {
  "url": response.url,
  "title": response.css("h1::text").get(),
  "price": response.css(".price::text").get(),
  }

That's a complete pagination + detail crawler in under 20 lines. Behind the scenes, CrawlSpider:

Fetches start_urls.
Runs every Rule's LinkExtractor on the response.
For each matching link, schedules a request. If the Rule has a callback, the response is parsed by it; otherwise it's recursively crawled (if follow=True).

Rule parameters worth knowing

link_extractor, usually a LinkExtractor instance. Filter by allow, deny, restrict_css, restrict_xpaths, tags, attrs.
callback, string name of the parser method. Missing means "just follow, don't extract."
follow, whether to also follow links found in the response from this rule.
process_request, modify the request before scheduling.
process_links, filter the extracted links.

When CrawlSpider is wrong

For complex flows, multi-step logins, AJAX-driven pagination, conditional traversal based on item content, CrawlSpider's declarative model fights you. Drop back to scrapy.Spider and yield requests manually.

Rule of thumb: if you can describe the crawl as "follow link patterns A and B, parse C with callback X," CrawlSpider wins. If there's state or conditional logic, plain Spider wins.

SitemapSpider, start from sitemap.xml

from scrapy.spiders import SitemapSpider

class CatalogSitemapSpider(SitemapSpider):
  name = "sitemap_crawler"
  sitemap_urls = ["https://practice.scrapingcentral.com/sitemap.xml"]
  sitemap_rules = [
  (r"/products/", "parse_product"),
  (r"/category/", "parse_category"),
  ]

  def parse_product(self, response):
  yield {"url": response.url, "title": response.css("h1::text").get()}

  def parse_category(self, response):
  ...

SitemapSpider reads sitemap.xml, follows sitemap-index references (nested sitemaps), and dispatches each URL to the callback whose regex matches.

Why use it:

Skips link discovery entirely, the sitemap is the URL list.
Far fewer requests than crawling the site.
Easier to parallelize and resume.

Practical pattern: when a site publishes a real sitemap, prefer SitemapSpider over CrawlSpider. The sitemap is the site owner saying "here are the URLs I want indexed." Use it.

Sitemap with filters

sitemap_follow = [r"/products-"]  # only follow sub-sitemaps matching
sitemap_alternate_links = True  # also crawl alternate-language versions

If the sitemap is gzip encoded (.xml.gz), SitemapSpider decompresses automatically.

XMLFeedSpider, parse XML feeds

For RSS, Atom, or product XML feeds:

from scrapy.spiders import XMLFeedSpider

class FeedSpider(XMLFeedSpider):
  name = "feed"
  start_urls = ["https://example.com/feed.xml"]
  iterator = "iternodes"  # or 'xml', 'html'
  itertag = "item"

  def parse_node(self, response, node):
  yield {
  "title": node.xpath("title/text()").get(),
  "url": node.xpath("link/text()").get(),
  }

Less common in modern scraping (most feeds have moved to JSON APIs), but useful for legacy targets.

CSVFeedSpider, parse CSV feeds

from scrapy.spiders import CSVFeedSpider

class CSVSpider(CSVFeedSpider):
  name = "csv"
  start_urls = ["https://example.com/data.csv"]
  delimiter = ","
  headers = ["sku", "title", "price"]

  def parse_row(self, response, row):
  yield {"sku": row["sku"], "title": row["title"], "price": float(row["price"])}

Niche, but if a target hands you a CSV, why parse HTML?

Mixing specialized spiders with middleware

Specialized spiders are just subclasses of Spider. All your middleware, pipelines, and settings work identically. Proxy injection still happens, dedup still runs, item loaders still apply.

Anti-patterns

Using CrawlSpider for a flat list page. Just use Spider with response.follow_all. CrawlSpider is overkill for a single rule.
Using SitemapSpider when the sitemap is incomplete. If the sitemap covers 30% of the catalog, you'll miss 70% silently. Verify coverage before trusting.
Conditional logic inside Rules. If a Rule callback needs to inspect the page to decide whether to crawl further, you've outgrown CrawlSpider's declarative model.

Hands-on lab

Against Catalog108:

Write a SitemapSpider that crawls only /products/{slug} URLs from /sitemap.xml. Count items.
Write a CrawlSpider that traverses /products listing pages by following a.next, then parses each product card. Count items.
Compare: which crawl visits more pages? Which finishes faster? Sitemap-based is usually 5–10x cheaper because it skips listing pages entirely.

CrawlSpider, SitemapSpider, and Other Specialized Spiders

What you’ll learn

The four built-in spider classes

CrawlSpider, rule-based traversal

Rule parameters worth knowing

When CrawlSpider is wrong

SitemapSpider, start from sitemap.xml

Sitemap with filters

XMLFeedSpider, parse XML feeds

CSVFeedSpider, parse CSV feeds

Mixing specialized spiders with middleware

Anti-patterns

Hands-on lab

Hands-on lab

Quiz, check your understanding

A CrawlSpider Rule has `callback='parse_product'` and `follow=False`. What happens?