Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scrapy Spiders and Items

Define structured data with Scrapy Items and build advanced spiders with CrawlSpider, SitemapSpider, and custom parsing logic.

Python Scraping · #5intermediate2 min read
Share:WhatsAppLinkedIn

Scrapy Items give your scraped data a clear structure, while different spider types handle various crawling patterns. Together, they make your scraping projects organized and maintainable.

Defining Items

Items declare the fields your scraper will extract. Edit items.py:

import scrapy


class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    description = scrapy.Field()

Using Items in a Spider

import scrapy
from quotescraper.items import ProductItem


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            item = ProductItem()
            item["name"] = book.css("h3 a::attr(title)").get()
            item["price"] = book.css("p.price_color::text").get()
            item["url"] = response.urljoin(
                book.css("h3 a::attr(href)").get()
            )
            item["rating"] = book.css("p.star-rating::attr(class)").get()
            yield item

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Item Loaders for Cleaner Data

Item Loaders process data as it is extracted, handling whitespace, type conversion, and formatting.

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose


class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(str.strip, lambda x: x.replace("£", ""))
    name_in = MapCompose(str.strip)


class ProductSpider(scrapy.Spider):
    name = "products_clean"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            loader = ProductLoader(item=ProductItem(), selector=book)
            loader.add_css("name", "h3 a::attr(title)")
            loader.add_css("price", "p.price_color::text")
            loader.add_value("url", response.urljoin(
                book.css("h3 a::attr(href)").get()
            ))
            yield loader.load_item()

CrawlSpider for Rule-Based Crawling

CrawlSpider follows links automatically based on rules you define:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class BookCrawler(CrawlSpider):
    name = "book_crawler"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (
        Rule(LinkExtractor(restrict_css="li.next")),  # Follow pagination
        Rule(
            LinkExtractor(restrict_css="article.product_pod h3"),
            callback="parse_book",
        ),
    )

    def parse_book(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "description": response.css("#product_description ~ p::text").get(),
        }

Tips

  • Always define Items, they serve as documentation for your data schema.
  • Use Item Loaders to keep parsing code clean and reusable.
  • CrawlSpider is ideal for sites where you want to follow links matching specific patterns.
  • For large crawls across many pages, use a proxy rotation service like ScraperAPI to avoid IP bans.

Next Steps

  • Learn about Scrapy middleware and pipelines for processing and storing data
  • Explore Scrapy signals for monitoring spider lifecycle events