Scrapy Spiders and Items
Define structured data with Scrapy Items and build advanced spiders with CrawlSpider, SitemapSpider, and custom parsing logic.
Python Scraping · #5intermediate2 min read
Scrapy Items give your scraped data a clear structure, while different spider types handle various crawling patterns. Together, they make your scraping projects organized and maintainable.
Defining Items
Items declare the fields your scraper will extract. Edit items.py:
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
rating = scrapy.Field()
description = scrapy.Field()
Using Items in a Spider
import scrapy
from quotescraper.items import ProductItem
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
item = ProductItem()
item["name"] = book.css("h3 a::attr(title)").get()
item["price"] = book.css("p.price_color::text").get()
item["url"] = response.urljoin(
book.css("h3 a::attr(href)").get()
)
item["rating"] = book.css("p.star-rating::attr(class)").get()
yield item
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Item Loaders for Cleaner Data
Item Loaders process data as it is extracted, handling whitespace, type conversion, and formatting.
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
price_in = MapCompose(str.strip, lambda x: x.replace("£", ""))
name_in = MapCompose(str.strip)
class ProductSpider(scrapy.Spider):
name = "products_clean"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
loader = ProductLoader(item=ProductItem(), selector=book)
loader.add_css("name", "h3 a::attr(title)")
loader.add_css("price", "p.price_color::text")
loader.add_value("url", response.urljoin(
book.css("h3 a::attr(href)").get()
))
yield loader.load_item()
CrawlSpider for Rule-Based Crawling
CrawlSpider follows links automatically based on rules you define:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BookCrawler(CrawlSpider):
name = "book_crawler"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
rules = (
Rule(LinkExtractor(restrict_css="li.next")), # Follow pagination
Rule(
LinkExtractor(restrict_css="article.product_pod h3"),
callback="parse_book",
),
)
def parse_book(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css("p.price_color::text").get(),
"description": response.css("#product_description ~ p::text").get(),
}
Tips
- Always define Items, they serve as documentation for your data schema.
- Use Item Loaders to keep parsing code clean and reusable.
- CrawlSpider is ideal for sites where you want to follow links matching specific patterns.
- For large crawls across many pages, use a proxy rotation service like ScraperAPI to avoid IP bans.
Next Steps
- Learn about Scrapy middleware and pipelines for processing and storing data
- Explore Scrapy signals for monitoring spider lifecycle events