Items, ItemLoaders, Selectors
The three Scrapy primitives that make scraped data clean and consistent: typed Items, ItemLoaders for normalization, and Selectors for extraction.
What you’ll learn
- Define an Item class with typed fields.
- Use ItemLoader processors to strip, normalize, and coerce values at load time.
- Combine CSS and XPath selectors fluently.
Three primitives. Each one solves a different problem on the path from raw HTML to clean records.
Selectors, the extraction layer
Every response in Scrapy is a Selector. You query it with CSS or XPath:
title = response.css("h1::text").get()
price = response.xpath("//span[@class='price']/text()").get()
all_skus = response.css(".sku::text").getall()
.get() returns the first match or None. .getall() returns a list. Use ::text (CSS) or /text() (XPath) to extract text nodes; use ::attr(href) (CSS) or /@href (XPath) for attributes.
Selectors chain. Once you scope to a card, sub-queries are relative:
for card in response.css(".product-card"):
yield {
"title": card.css("h3::text").get(),
"price": card.css(".price::text").get(),
"url": card.css("a::attr(href)").get(),
}
This pattern, outer iterator, relative inner queries, is the workhorse of list-page parsing. Mistake to avoid: using absolute queries inside the loop (you'll get the first match on the whole page every iteration).
CSS vs XPath in Scrapy
CSS is more readable for class/id selection. XPath is more powerful for axis traversal (following-sibling::, ancestor::, text()[contains(., "foo")]). Most production Scrapy code uses CSS by default and reaches for XPath when CSS can't express the query.
# CSS: simpler
response.css("div.price::text").get()
# XPath: handles "the dt with text 'SKU' and its following dd"
response.xpath("//dt[normalize-space()='SKU']/following-sibling::dd[1]/text()").get()
Items, typed records
An Item is a dict with a schema. You declare fields:
import scrapy
class ProductItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
sku = scrapy.Field()
in_stock = scrapy.Field()
scraped_at = scrapy.Field()
In your spider you can yield either a plain dict or an Item. The advantage of Items: pipelines can use isinstance(item, ProductItem) to dispatch, and you get clear documentation of what fields exist.
For typed validation, the modern alternative is attrs or pydantic models. Scrapy supports dataclass and attrs items directly:
from dataclasses import dataclass, field
@dataclass
class ProductItem:
url: str
title: str
price: float
sku: str = ""
in_stock: bool = True
Yield a ProductItem(...) and pipelines see a typed object. Type hints become real documentation.
ItemLoaders, the normalization layer
Raw HTML is dirty: leading whitespace, currency symbols, "In Stock" vs "in stock", mixed None/"". ItemLoader is the place to clean.
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
def parse_price(text):
return float(text.replace("$", "").replace(",", "").strip())
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
title_in = MapCompose(str.strip)
price_in = MapCompose(parse_price)
description_out = Join(" ")
def parse_product(self, response):
loader = ProductLoader(item=ProductItem(), selector=response)
loader.add_css("title", "h1::text")
loader.add_css("price", ".price::text")
loader.add_css("description", ".description p::text")
loader.add_value("url", response.url)
yield loader.load_item()
Key concepts:
_inprocessors run on each value as it's added.MapCompose(str.strip)strips every input._outprocessors run when you callload_item().TakeFirst()picks the first non-empty value.MapComposechains functions:MapCompose(str.strip, str.lower, parse_price).Join(" ")concatenates a list of strings into one.
The win: normalization logic lives in one place, not scattered across spiders. Add a new field, add its in/out processors, done.
Selectors against JSON-LD
Modern e-commerce sites embed schema.org data in <script type="application/ld+json">. Scrapy handles this:
import json
def parse_product(self, response):
raw = response.css("script[type='application/ld+json']::text").get()
data = json.loads(raw)
yield {
"title": data.get("name"),
"price": data.get("offers", {}).get("price"),
"sku": data.get("sku"),
}
Always check for JSON-LD before writing 30 selector lines, it often hands you the entire item.
A complete listing → detail pattern
import scrapy
from myproject.items import ProductItem
from myproject.loaders import ProductLoader
class ProductsSpider(scrapy.Spider):
name = "products"
start_urls = ["https://practice.scrapingcentral.com/products"]
def parse(self, response):
for href in response.css(".product-card a::attr(href)").getall():
yield response.follow(href, self.parse_detail)
if next := response.css("a.next::attr(href)").get():
yield response.follow(next, self.parse)
def parse_detail(self, response):
loader = ProductLoader(item=ProductItem(), selector=response)
loader.add_value("url", response.url)
loader.add_css("title", "h1::text")
loader.add_css("price", ".price::text")
loader.add_css("sku", "[data-sku]::attr(data-sku)")
loader.add_css("description", ".description p::text")
yield loader.load_item()
The spider stays small. Loader handles cleanup. Items document the schema. This is the idiomatic shape of a Scrapy spider.
Hands-on lab
Against /products at Catalog108:
- Define a
ProductItemwithurl,title,price(as float),sku,description. - Write a
ProductLoaderthat strips whitespace fromtitle, parsespriceto float, and joins multi-paragraphdescriptionwith spaces. - Run the spider. Check that 100% of items have all fields populated with the right types.
If anything is None or a string when it should be a float, your loader is missing a processor, fix it there, never in the pipeline.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.