Items, ItemLoaders, Selectors, Production, Scale & Career

The three Scrapy primitives that make scraped data clean and consistent: typed Items, ItemLoaders for normalization, and Selectors for extraction.

Three primitives. Each one solves a different problem on the path from raw HTML to clean records.

Selectors, the extraction layer

Every response in Scrapy is a Selector. You query it with CSS or XPath:

title = response.css("h1::text").get()
price = response.xpath("//span[@class='price']/text()").get()
all_skus = response.css(".sku::text").getall()

.get() returns the first match or None. .getall() returns a list. Use ::text (CSS) or /text() (XPath) to extract text nodes; use ::attr(href) (CSS) or /@href (XPath) for attributes.

Selectors chain. Once you scope to a card, sub-queries are relative:

for card in response.css(".product-card"):
  yield {
  "title": card.css("h3::text").get(),
  "price": card.css(".price::text").get(),
  "url": card.css("a::attr(href)").get(),
  }

This pattern, outer iterator, relative inner queries, is the workhorse of list-page parsing. Mistake to avoid: using absolute queries inside the loop (you'll get the first match on the whole page every iteration).

CSS vs XPath in Scrapy

CSS is more readable for class/id selection. XPath is more powerful for axis traversal (following-sibling::, ancestor::, text()[contains(., "foo")]). Most production Scrapy code uses CSS by default and reaches for XPath when CSS can't express the query.

# CSS: simpler
response.css("div.price::text").get()

# XPath: handles "the dt with text 'SKU' and its following dd"
response.xpath("//dt[normalize-space()='SKU']/following-sibling::dd[1]/text()").get()

Items, typed records

An Item is a dict with a schema. You declare fields:

import scrapy

class ProductItem(scrapy.Item):
  url = scrapy.Field()
  title = scrapy.Field()
  price = scrapy.Field()
  sku = scrapy.Field()
  in_stock = scrapy.Field()
  scraped_at = scrapy.Field()

In your spider you can yield either a plain dict or an Item. The advantage of Items: pipelines can use isinstance(item, ProductItem) to dispatch, and you get clear documentation of what fields exist.

For typed validation, the modern alternative is attrs or pydantic models. Scrapy supports dataclass and attrs items directly:

from dataclasses import dataclass, field

@dataclass
class ProductItem:
  url: str
  title: str
  price: float
  sku: str = ""
  in_stock: bool = True

Yield a ProductItem(...) and pipelines see a typed object. Type hints become real documentation.

ItemLoaders, the normalization layer

Raw HTML is dirty: leading whitespace, currency symbols, "In Stock" vs "in stock", mixed None/"". ItemLoader is the place to clean.

from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader

def parse_price(text):
  return float(text.replace("$", "").replace(",", "").strip())

class ProductLoader(ItemLoader):
  default_output_processor = TakeFirst()
  title_in = MapCompose(str.strip)
  price_in = MapCompose(parse_price)
  description_out = Join(" ")

def parse_product(self, response):
  loader = ProductLoader(item=ProductItem(), selector=response)
  loader.add_css("title", "h1::text")
  loader.add_css("price", ".price::text")
  loader.add_css("description", ".description p::text")
  loader.add_value("url", response.url)
  yield loader.load_item()

Key concepts:

_in processors run on each value as it's added. MapCompose(str.strip) strips every input.
_out processors run when you call load_item(). TakeFirst() picks the first non-empty value.
MapCompose chains functions: MapCompose(str.strip, str.lower, parse_price).
Join(" ") concatenates a list of strings into one.

The win: normalization logic lives in one place, not scattered across spiders. Add a new field, add its in/out processors, done.

Selectors against JSON-LD

Modern e-commerce sites embed schema.org data in <script type="application/ld+json">. Scrapy handles this:

import json

def parse_product(self, response):
  raw = response.css("script[type='application/ld+json']::text").get()
  data = json.loads(raw)
  yield {
  "title": data.get("name"),
  "price": data.get("offers", {}).get("price"),
  "sku": data.get("sku"),
  }

Always check for JSON-LD before writing 30 selector lines, it often hands you the entire item.

A complete listing → detail pattern

import scrapy
from myproject.items import ProductItem
from myproject.loaders import ProductLoader

class ProductsSpider(scrapy.Spider):
  name = "products"
  start_urls = ["https://practice.scrapingcentral.com/products"]

  def parse(self, response):
  for href in response.css(".product-card a::attr(href)").getall():
  yield response.follow(href, self.parse_detail)
  if next := response.css("a.next::attr(href)").get():
  yield response.follow(next, self.parse)

  def parse_detail(self, response):
  loader = ProductLoader(item=ProductItem(), selector=response)
  loader.add_value("url", response.url)
  loader.add_css("title", "h1::text")
  loader.add_css("price", ".price::text")
  loader.add_css("sku", "[data-sku]::attr(data-sku)")
  loader.add_css("description", ".description p::text")
  yield loader.load_item()

The spider stays small. Loader handles cleanup. Items document the schema. This is the idiomatic shape of a Scrapy spider.

Hands-on lab

Against /products at Catalog108:

Define a ProductItem with url, title, price (as float), sku, description.
Write a ProductLoader that strips whitespace from title, parses price to float, and joins multi-paragraph description with spaces.
Run the spider. Check that 100% of items have all fields populated with the right types.

If anything is None or a string when it should be a float, your loader is missing a processor, fix it there, never in the pipeline.

Items, ItemLoaders, Selectors

What you’ll learn

Selectors, the extraction layer

CSS vs XPath in Scrapy

Items, typed records

ItemLoaders, the normalization layer

Selectors against JSON-LD

A complete listing → detail pattern

Hands-on lab

Hands-on lab

Quiz, check your understanding

What does `MapCompose(str.strip, str.lower)` do when used as an `_in` processor?