Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F8beginner5 min read

Choosing Between CSS Selectors and XPath

When to use which. A short decision framework with concrete examples, not 'XPath is more powerful so always use XPath.'

What you’ll learn

  • Default to CSS for common cases, readability, library support, ease.
  • Recognise the three situations where XPath wins decisively.
  • Resist the mistake of using XPath for everything just because you read it's more powerful.
  • Combine both in one scraper without it becoming a mess.

You learned both. Now: which do you reach for first?

The default: CSS

Reach for CSS by default. Reasons:

  1. More readable. article.product .price is obvious. //article[@class="product"]//span[@class="price"] is the same intent in 50% more characters.
  2. Better library support. Every major scraping library supports CSS; some support XPath only as an extension or with significant warts.
  3. Same syntax everywhere. The CSS you write in BeautifulSoup is the CSS you write in Playwright, browser DevTools, and stylesheets. XPath syntax is identical across libraries, but the quirks (XPath 1.0 vs 2.0, namespace handling, whitespace) are not.
  4. Browsers think in CSS. When you copy a selector from DevTools, you get CSS. The mental model matches.

If you can do it cleanly in CSS, do it in CSS.

The three cases where XPath wins

There are exactly three. Memorize them.

1. Walking up the tree

When the stable anchor is inside what you want to extract:

# You found the in-stock badge. You want the whole product card.
price_elements = tree.xpath('//span[@class="in-stock-badge"]/ancestor::article[@class="product"]')

# CSS equivalent: impossible without first finding the article, then checking each one for the badge, two steps.

This is the single most common XPath-only case. Real example: extracting only the product cards that contain a specific badge, or only the table rows where a specific cell has a specific class.

2. Matching by text content

When the only stable hook is what the element says:

# The "Add to cart" button has no stable class/id, but the text is constant
button = tree.xpath('//button[normalize-space(.)="Add to cart"]')

# CSS in BeautifulSoup: soup.select('button:-soup-contains("Add to cart")')
# CSS in lxml: not supported, fall back to XPath
# CSS in Playwright: page.locator("button", has_text="Add to cart"), also works

Playwright's has_text is the new portable way; for pure HTML parsing libraries, XPath is still cleaner.

3. Complex sibling navigation

When the data you want is "the <p> right after the <h2> that says 'Reviews'":

review_paragraph = tree.xpath('//h2[normalize-space(.)="Reviews"]/following-sibling::p[1]')

# CSS approximation: h2 + p, but you'd have to first verify the h2's text. Two steps.

The following-sibling:: axis lets you say "the next sibling that matches X" in one expression.

When XPath does NOT win

Don't reach for XPath when you can write CSS. Anti-patterns:

  • //div[@id="main"] instead of #main, same thing, worse readability
  • //*[@class="price"] instead of .price
  • //ul/li[position()=1] instead of ul li:first-child
  • //article//span for "find a span inside an article", article span is half the characters

A scraper that uses XPath for everything reads like Latin for no benefit.

The hybrid approach

In practice, scrapers mix both. A typical Scrapy spider:

def parse(self, response):
  # CSS for the easy cases
  for card in response.css('article.product'):
  title = card.css('h2::text').get()
  price = card.css('.price::text').get()

  # XPath for the awkward 10%
  in_stock = card.xpath('.//span[contains(@class, "in-stock")]').get() is not None

  yield {'title': title, 'price': price, 'in_stock': in_stock}

Note the .// in the XPath, that leading . says "from this card's subtree," not from the document root. Forgetting it is the most common Scrapy XPath bug.

Library-specific notes

Library CSS support XPath support
BeautifulSoup .select(), fully featured Use lxml backend; or soupsieve for limited XPath
lxml .cssselect(), solid Native .xpath(), top-class
Scrapy .css() selectors .xpath() selectors, both equally first-class
Playwright page.locator(css) page.locator("xpath=//...")
Selenium By.CSS_SELECTOR By.XPATH
Symfony DomCrawler .filter() for CSS .filterXPath() for XPath
PHP DOMDocument Not built-in DOMXPath, native

Notable: BeautifulSoup is CSS-first; if you need XPath in a BS4 pipeline, switch to lxml directly for that step. Mixing is fine, your scraper isn't worse for using two libraries.

A practical mental flowchart

Need to find an element?
│
├─ Stable id, class, or data-* attribute?  → CSS
├─ Position-based (nth child, first/last)?  → CSS
├─ "All <span> inside .product"?  → CSS (descendant)
├─ Need to walk UP to a parent?  → XPath
├─ Need to match by text content?  → XPath (or library helper)
└─ Need "next sibling matching X"?  → XPath

When you find yourself reaching for XPath, ask once: "Can I anchor on something more stable and avoid this?" If yes, refactor. If no, XPath.

Hands-on lab

Open practice.scrapingcentral.com/products/1-yellow-ceramic-mug and try to extract:

  1. The product title
  2. The price
  3. Every review where the rating is exactly 5 stars
  4. The total review count (which is written as text on the page near the reviews heading)

For each, try CSS first. Where you fail (or it gets ugly), switch to XPath. By the end you'll have a strong intuition for which tool to reach for in which situation, not theoretical, hands-on.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products/1-yellow-ceramic-mug

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Choosing Between CSS Selectors and XPath1 / 8

You found a stable `<span class="in-stock-badge">` inside each available product card and you want to extract the WHOLE card for each in-stock product. Which is the cleanest one-liner?

Score so far: 0 / 0