CrawlSpider, SitemapSpider, and Other Specialized Spiders
Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.
What you’ll learn
- Use CrawlSpider rules to traverse an entire site without manual link-following.
- Use SitemapSpider to crawl from sitemap.xml entries.
- Recognize when these helpers help and when plain Spider is cleaner.
Plain scrapy.Spider is the manual transmission. Specialized subclasses are the automatic. Each is a good fit for a specific crawl shape.
The four built-in spider classes
| Class | What it's for |
|---|---|
scrapy.Spider |
Generic, you control link-following entirely |
CrawlSpider |
Whole-site crawls via declarative rules |
SitemapSpider |
Crawl from sitemap.xml entries |
XMLFeedSpider / CSVFeedSpider |
Parse XML or CSV feeds directly |
CrawlSpider, rule-based traversal
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class CatalogCrawler(CrawlSpider):
name = "catalog_crawler"
allowed_domains = ["practice.scrapingcentral.com"]
start_urls = ["https://practice.scrapingcentral.com/products"]
rules = (
# Follow listing pagination, don't extract items from list pages
Rule(LinkExtractor(restrict_css="a.next"), follow=True),
# Follow links to product detail pages, parse with parse_product
Rule(
LinkExtractor(restrict_css=".product-card a"),
callback="parse_product",
follow=False,
),
)
def parse_product(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
That's a complete pagination + detail crawler in under 20 lines. Behind the scenes, CrawlSpider:
- Fetches start_urls.
- Runs every Rule's LinkExtractor on the response.
- For each matching link, schedules a request. If the Rule has a
callback, the response is parsed by it; otherwise it's recursively crawled (iffollow=True).
Rule parameters worth knowing
link_extractor, usually aLinkExtractorinstance. Filter byallow,deny,restrict_css,restrict_xpaths,tags,attrs.callback, string name of the parser method. Missing means "just follow, don't extract."follow, whether to also follow links found in the response from this rule.process_request, modify the request before scheduling.process_links, filter the extracted links.
When CrawlSpider is wrong
For complex flows, multi-step logins, AJAX-driven pagination, conditional traversal based on item content, CrawlSpider's declarative model fights you. Drop back to scrapy.Spider and yield requests manually.
Rule of thumb: if you can describe the crawl as "follow link patterns A and B, parse C with callback X," CrawlSpider wins. If there's state or conditional logic, plain Spider wins.
SitemapSpider, start from sitemap.xml
from scrapy.spiders import SitemapSpider
class CatalogSitemapSpider(SitemapSpider):
name = "sitemap_crawler"
sitemap_urls = ["https://practice.scrapingcentral.com/sitemap.xml"]
sitemap_rules = [
(r"/products/", "parse_product"),
(r"/category/", "parse_category"),
]
def parse_product(self, response):
yield {"url": response.url, "title": response.css("h1::text").get()}
def parse_category(self, response):
...
SitemapSpider reads sitemap.xml, follows sitemap-index references (nested sitemaps), and dispatches each URL to the callback whose regex matches.
Why use it:
- Skips link discovery entirely, the sitemap is the URL list.
- Far fewer requests than crawling the site.
- Easier to parallelize and resume.
Practical pattern: when a site publishes a real sitemap, prefer SitemapSpider over CrawlSpider. The sitemap is the site owner saying "here are the URLs I want indexed." Use it.
Sitemap with filters
sitemap_follow = [r"/products-"] # only follow sub-sitemaps matching
sitemap_alternate_links = True # also crawl alternate-language versions
If the sitemap is gzip encoded (.xml.gz), SitemapSpider decompresses automatically.
XMLFeedSpider, parse XML feeds
For RSS, Atom, or product XML feeds:
from scrapy.spiders import XMLFeedSpider
class FeedSpider(XMLFeedSpider):
name = "feed"
start_urls = ["https://example.com/feed.xml"]
iterator = "iternodes" # or 'xml', 'html'
itertag = "item"
def parse_node(self, response, node):
yield {
"title": node.xpath("title/text()").get(),
"url": node.xpath("link/text()").get(),
}
Less common in modern scraping (most feeds have moved to JSON APIs), but useful for legacy targets.
CSVFeedSpider, parse CSV feeds
from scrapy.spiders import CSVFeedSpider
class CSVSpider(CSVFeedSpider):
name = "csv"
start_urls = ["https://example.com/data.csv"]
delimiter = ","
headers = ["sku", "title", "price"]
def parse_row(self, response, row):
yield {"sku": row["sku"], "title": row["title"], "price": float(row["price"])}
Niche, but if a target hands you a CSV, why parse HTML?
Mixing specialized spiders with middleware
Specialized spiders are just subclasses of Spider. All your middleware, pipelines, and settings work identically. Proxy injection still happens, dedup still runs, item loaders still apply.
Anti-patterns
- Using CrawlSpider for a flat list page. Just use Spider with
response.follow_all. CrawlSpider is overkill for a single rule. - Using SitemapSpider when the sitemap is incomplete. If the sitemap covers 30% of the catalog, you'll miss 70% silently. Verify coverage before trusting.
- Conditional logic inside Rules. If a Rule callback needs to inspect the page to decide whether to crawl further, you've outgrown CrawlSpider's declarative model.
Hands-on lab
Against Catalog108:
- Write a
SitemapSpiderthat crawls only/products/{slug}URLs from/sitemap.xml. Count items. - Write a
CrawlSpiderthat traverses/productslisting pages by followinga.next, then parses each product card. Count items. - Compare: which crawl visits more pages? Which finishes faster? Sitemap-based is usually 5–10x cheaper because it skips listing pages entirely.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/sitemap.xmlQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.