Scrapy Spiders and Items - Python Scraping

Define structured data with Scrapy Items and build advanced spiders with CrawlSpider, SitemapSpider, and custom parsing logic.

Scrapy Items give your scraped data a clear structure, while different spider types handle various crawling patterns. Together, they make your scraping projects organized and maintainable.

Defining Items

Items declare the fields your scraper will extract. Edit items.py:

import scrapy


class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    description = scrapy.Field()

Using Items in a Spider

import scrapy
from quotescraper.items import ProductItem


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            item = ProductItem()
            item["name"] = book.css("h3 a::attr(title)").get()
            item["price"] = book.css("p.price_color::text").get()
            item["url"] = response.urljoin(
                book.css("h3 a::attr(href)").get()
            )
            item["rating"] = book.css("p.star-rating::attr(class)").get()
            yield item

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Item Loaders for Cleaner Data

Item Loaders process data as it is extracted, handling whitespace, type conversion, and formatting.

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose


class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(str.strip, lambda x: x.replace("£", ""))
    name_in = MapCompose(str.strip)


class ProductSpider(scrapy.Spider):
    name = "products_clean"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            loader = ProductLoader(item=ProductItem(), selector=book)
            loader.add_css("name", "h3 a::attr(title)")
            loader.add_css("price", "p.price_color::text")
            loader.add_value("url", response.urljoin(
                book.css("h3 a::attr(href)").get()
            ))
            yield loader.load_item()

CrawlSpider for Rule-Based Crawling

CrawlSpider follows links automatically based on rules you define:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class BookCrawler(CrawlSpider):
    name = "book_crawler"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (
        Rule(LinkExtractor(restrict_css="li.next")),  # Follow pagination
        Rule(
            LinkExtractor(restrict_css="article.product_pod h3"),
            callback="parse_book",
        ),
    )

    def parse_book(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "description": response.css("#product_description ~ p::text").get(),
        }

Tips

Always define Items, they serve as documentation for your data schema.
Use Item Loaders to keep parsing code clean and reusable.
CrawlSpider is ideal for sites where you want to follow links matching specific patterns.
For large crawls across many pages, use a proxy rotation service like ScraperAPI to avoid IP bans.

Next Steps

Learn about Scrapy middleware and pipelines for processing and storing data
Explore Scrapy signals for monitoring spider lifecycle events