Choosing Between CSS Selectors and XPath, Foundations

When to use which. A short decision framework with concrete examples, not 'XPath is more powerful so always use XPath.'

You learned both. Now: which do you reach for first?

The default: CSS

Reach for CSS by default. Reasons:

More readable. article.product .price is obvious. //article[@class="product"]//span[@class="price"] is the same intent in 50% more characters.
Better library support. Every major scraping library supports CSS; some support XPath only as an extension or with significant warts.
Same syntax everywhere. The CSS you write in BeautifulSoup is the CSS you write in Playwright, browser DevTools, and stylesheets. XPath syntax is identical across libraries, but the quirks (XPath 1.0 vs 2.0, namespace handling, whitespace) are not.
Browsers think in CSS. When you copy a selector from DevTools, you get CSS. The mental model matches.

If you can do it cleanly in CSS, do it in CSS.

The three cases where XPath wins

There are exactly three. Memorize them.

1. Walking up the tree

When the stable anchor is inside what you want to extract:

# You found the in-stock badge. You want the whole product card.
price_elements = tree.xpath('//span[@class="in-stock-badge"]/ancestor::article[@class="product"]')

# CSS equivalent: impossible without first finding the article, then checking each one for the badge, two steps.

This is the single most common XPath-only case. Real example: extracting only the product cards that contain a specific badge, or only the table rows where a specific cell has a specific class.

2. Matching by text content

When the only stable hook is what the element says:

# The "Add to cart" button has no stable class/id, but the text is constant
button = tree.xpath('//button[normalize-space(.)="Add to cart"]')

# CSS in BeautifulSoup: soup.select('button:-soup-contains("Add to cart")')
# CSS in lxml: not supported, fall back to XPath
# CSS in Playwright: page.locator("button", has_text="Add to cart"), also works

Playwright's has_text is the new portable way; for pure HTML parsing libraries, XPath is still cleaner.

3. Complex sibling navigation

When the data you want is "the <p> right after the <h2> that says 'Reviews'":

review_paragraph = tree.xpath('//h2[normalize-space(.)="Reviews"]/following-sibling::p[1]')

# CSS approximation: h2 + p, but you'd have to first verify the h2's text. Two steps.

The following-sibling:: axis lets you say "the next sibling that matches X" in one expression.

When XPath does NOT win

Don't reach for XPath when you can write CSS. Anti-patterns:

//div[@id="main"] instead of #main, same thing, worse readability
//*[@class="price"] instead of .price
//ul/li[position()=1] instead of ul li:first-child
//article//span for "find a span inside an article", article span is half the characters

A scraper that uses XPath for everything reads like Latin for no benefit.

The hybrid approach

In practice, scrapers mix both. A typical Scrapy spider:

def parse(self, response):
  # CSS for the easy cases
  for card in response.css('article.product'):
  title = card.css('h2::text').get()
  price = card.css('.price::text').get()

  # XPath for the awkward 10%
  in_stock = card.xpath('.//span[contains(@class, "in-stock")]').get() is not None

  yield {'title': title, 'price': price, 'in_stock': in_stock}

Note the .// in the XPath, that leading . says "from this card's subtree," not from the document root. Forgetting it is the most common Scrapy XPath bug.

Library-specific notes

Library	CSS support	XPath support
BeautifulSoup	`.select()`, fully featured	Use `lxml` backend; or `soupsieve` for limited XPath
lxml	`.cssselect()`, solid	Native `.xpath()`, top-class
Scrapy	`.css()` selectors	`.xpath()` selectors, both equally first-class
Playwright	`page.locator(css)`	`page.locator("xpath=//...")`
Selenium	`By.CSS_SELECTOR`	`By.XPATH`
Symfony DomCrawler	`.filter()` for CSS	`.filterXPath()` for XPath
PHP DOMDocument	Not built-in	`DOMXPath`, native

Notable: BeautifulSoup is CSS-first; if you need XPath in a BS4 pipeline, switch to lxml directly for that step. Mixing is fine, your scraper isn't worse for using two libraries.

A practical mental flowchart

Need to find an element?
│
├─ Stable id, class, or data-* attribute?  → CSS
├─ Position-based (nth child, first/last)?  → CSS
├─ "All <span> inside .product"?  → CSS (descendant)
├─ Need to walk UP to a parent?  → XPath
├─ Need to match by text content?  → XPath (or library helper)
└─ Need "next sibling matching X"?  → XPath

When you find yourself reaching for XPath, ask once: "Can I anchor on something more stable and avoid this?" If yes, refactor. If no, XPath.

Hands-on lab

Open practice.scrapingcentral.com/products/1-yellow-ceramic-mug and try to extract:

The product title
The price
Every review where the rating is exactly 5 stars
The total review count (which is written as text on the page near the reviews heading)

For each, try CSS first. Where you fail (or it gets ugly), switch to XPath. By the end you'll have a strong intuition for which tool to reach for in which situation, not theoretical, hands-on.

Choosing Between CSS Selectors and XPath

What you’ll learn

The default: CSS

The three cases where XPath wins

1. Walking up the tree

2. Matching by text content

3. Complex sibling navigation

When XPath does NOT win

The hybrid approach

Library-specific notes

A practical mental flowchart

Hands-on lab

Hands-on lab

Quiz, check your understanding

You found a stable `<span class="in-stock-badge">` inside each available product card and you want to extract the WHOLE card for each in-stock product. Which is the cleanest one-liner?