Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.6intermediate4 min read

CrawlSpider, SitemapSpider, and Other Specialized Spiders

Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.

What you’ll learn

  • Use CrawlSpider rules to traverse an entire site without manual link-following.
  • Use SitemapSpider to crawl from sitemap.xml entries.
  • Recognize when these helpers help and when plain Spider is cleaner.

Plain scrapy.Spider is the manual transmission. Specialized subclasses are the automatic. Each is a good fit for a specific crawl shape.

The four built-in spider classes

Class What it's for
scrapy.Spider Generic, you control link-following entirely
CrawlSpider Whole-site crawls via declarative rules
SitemapSpider Crawl from sitemap.xml entries
XMLFeedSpider / CSVFeedSpider Parse XML or CSV feeds directly

CrawlSpider, rule-based traversal

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class CatalogCrawler(CrawlSpider):
  name = "catalog_crawler"
  allowed_domains = ["practice.scrapingcentral.com"]
  start_urls = ["https://practice.scrapingcentral.com/products"]

  rules = (
  # Follow listing pagination, don't extract items from list pages
  Rule(LinkExtractor(restrict_css="a.next"), follow=True),
  # Follow links to product detail pages, parse with parse_product
  Rule(
  LinkExtractor(restrict_css=".product-card a"),
  callback="parse_product",
  follow=False,
  ),
  )

  def parse_product(self, response):
  yield {
  "url": response.url,
  "title": response.css("h1::text").get(),
  "price": response.css(".price::text").get(),
  }

That's a complete pagination + detail crawler in under 20 lines. Behind the scenes, CrawlSpider:

  1. Fetches start_urls.
  2. Runs every Rule's LinkExtractor on the response.
  3. For each matching link, schedules a request. If the Rule has a callback, the response is parsed by it; otherwise it's recursively crawled (if follow=True).

Rule parameters worth knowing

  • link_extractor, usually a LinkExtractor instance. Filter by allow, deny, restrict_css, restrict_xpaths, tags, attrs.
  • callback, string name of the parser method. Missing means "just follow, don't extract."
  • follow, whether to also follow links found in the response from this rule.
  • process_request, modify the request before scheduling.
  • process_links, filter the extracted links.

When CrawlSpider is wrong

For complex flows, multi-step logins, AJAX-driven pagination, conditional traversal based on item content, CrawlSpider's declarative model fights you. Drop back to scrapy.Spider and yield requests manually.

Rule of thumb: if you can describe the crawl as "follow link patterns A and B, parse C with callback X," CrawlSpider wins. If there's state or conditional logic, plain Spider wins.

SitemapSpider, start from sitemap.xml

from scrapy.spiders import SitemapSpider

class CatalogSitemapSpider(SitemapSpider):
  name = "sitemap_crawler"
  sitemap_urls = ["https://practice.scrapingcentral.com/sitemap.xml"]
  sitemap_rules = [
  (r"/products/", "parse_product"),
  (r"/category/", "parse_category"),
  ]

  def parse_product(self, response):
  yield {"url": response.url, "title": response.css("h1::text").get()}

  def parse_category(self, response):
  ...

SitemapSpider reads sitemap.xml, follows sitemap-index references (nested sitemaps), and dispatches each URL to the callback whose regex matches.

Why use it:

  • Skips link discovery entirely, the sitemap is the URL list.
  • Far fewer requests than crawling the site.
  • Easier to parallelize and resume.

Practical pattern: when a site publishes a real sitemap, prefer SitemapSpider over CrawlSpider. The sitemap is the site owner saying "here are the URLs I want indexed." Use it.

Sitemap with filters

sitemap_follow = [r"/products-"]  # only follow sub-sitemaps matching
sitemap_alternate_links = True  # also crawl alternate-language versions

If the sitemap is gzip encoded (.xml.gz), SitemapSpider decompresses automatically.

XMLFeedSpider, parse XML feeds

For RSS, Atom, or product XML feeds:

from scrapy.spiders import XMLFeedSpider

class FeedSpider(XMLFeedSpider):
  name = "feed"
  start_urls = ["https://example.com/feed.xml"]
  iterator = "iternodes"  # or 'xml', 'html'
  itertag = "item"

  def parse_node(self, response, node):
  yield {
  "title": node.xpath("title/text()").get(),
  "url": node.xpath("link/text()").get(),
  }

Less common in modern scraping (most feeds have moved to JSON APIs), but useful for legacy targets.

CSVFeedSpider, parse CSV feeds

from scrapy.spiders import CSVFeedSpider

class CSVSpider(CSVFeedSpider):
  name = "csv"
  start_urls = ["https://example.com/data.csv"]
  delimiter = ","
  headers = ["sku", "title", "price"]

  def parse_row(self, response, row):
  yield {"sku": row["sku"], "title": row["title"], "price": float(row["price"])}

Niche, but if a target hands you a CSV, why parse HTML?

Mixing specialized spiders with middleware

Specialized spiders are just subclasses of Spider. All your middleware, pipelines, and settings work identically. Proxy injection still happens, dedup still runs, item loaders still apply.

Anti-patterns

  • Using CrawlSpider for a flat list page. Just use Spider with response.follow_all. CrawlSpider is overkill for a single rule.
  • Using SitemapSpider when the sitemap is incomplete. If the sitemap covers 30% of the catalog, you'll miss 70% silently. Verify coverage before trusting.
  • Conditional logic inside Rules. If a Rule callback needs to inspect the page to decide whether to crawl further, you've outgrown CrawlSpider's declarative model.

Hands-on lab

Against Catalog108:

  1. Write a SitemapSpider that crawls only /products/{slug} URLs from /sitemap.xml. Count items.
  2. Write a CrawlSpider that traverses /products listing pages by following a.next, then parses each product card. Count items.
  3. Compare: which crawl visits more pages? Which finishes faster? Sitemap-based is usually 5–10x cheaper because it skips listing pages entirely.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /sitemap.xml

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

CrawlSpider, SitemapSpider, and Other Specialized Spiders1 / 8

A CrawlSpider Rule has `callback='parse_product'` and `follow=False`. What happens?

Score so far: 0 / 0