Choosing Between CSS Selectors and XPath
When to use which. A short decision framework with concrete examples, not 'XPath is more powerful so always use XPath.'
What you’ll learn
- Default to CSS for common cases, readability, library support, ease.
- Recognise the three situations where XPath wins decisively.
- Resist the mistake of using XPath for everything just because you read it's more powerful.
- Combine both in one scraper without it becoming a mess.
You learned both. Now: which do you reach for first?
The default: CSS
Reach for CSS by default. Reasons:
- More readable.
article.product .priceis obvious.//article[@class="product"]//span[@class="price"]is the same intent in 50% more characters. - Better library support. Every major scraping library supports CSS; some support XPath only as an extension or with significant warts.
- Same syntax everywhere. The CSS you write in BeautifulSoup is the CSS you write in Playwright, browser DevTools, and stylesheets. XPath syntax is identical across libraries, but the quirks (XPath 1.0 vs 2.0, namespace handling, whitespace) are not.
- Browsers think in CSS. When you copy a selector from DevTools, you get CSS. The mental model matches.
If you can do it cleanly in CSS, do it in CSS.
The three cases where XPath wins
There are exactly three. Memorize them.
1. Walking up the tree
When the stable anchor is inside what you want to extract:
# You found the in-stock badge. You want the whole product card.
price_elements = tree.xpath('//span[@class="in-stock-badge"]/ancestor::article[@class="product"]')
# CSS equivalent: impossible without first finding the article, then checking each one for the badge, two steps.
This is the single most common XPath-only case. Real example: extracting only the product cards that contain a specific badge, or only the table rows where a specific cell has a specific class.
2. Matching by text content
When the only stable hook is what the element says:
# The "Add to cart" button has no stable class/id, but the text is constant
button = tree.xpath('//button[normalize-space(.)="Add to cart"]')
# CSS in BeautifulSoup: soup.select('button:-soup-contains("Add to cart")')
# CSS in lxml: not supported, fall back to XPath
# CSS in Playwright: page.locator("button", has_text="Add to cart"), also works
Playwright's has_text is the new portable way; for pure HTML parsing libraries, XPath is still cleaner.
3. Complex sibling navigation
When the data you want is "the <p> right after the <h2> that says 'Reviews'":
review_paragraph = tree.xpath('//h2[normalize-space(.)="Reviews"]/following-sibling::p[1]')
# CSS approximation: h2 + p, but you'd have to first verify the h2's text. Two steps.
The following-sibling:: axis lets you say "the next sibling that matches X" in one expression.
When XPath does NOT win
Don't reach for XPath when you can write CSS. Anti-patterns:
//div[@id="main"]instead of#main, same thing, worse readability//*[@class="price"]instead of.price//ul/li[position()=1]instead oful li:first-child//article//spanfor "find a span inside an article",article spanis half the characters
A scraper that uses XPath for everything reads like Latin for no benefit.
The hybrid approach
In practice, scrapers mix both. A typical Scrapy spider:
def parse(self, response):
# CSS for the easy cases
for card in response.css('article.product'):
title = card.css('h2::text').get()
price = card.css('.price::text').get()
# XPath for the awkward 10%
in_stock = card.xpath('.//span[contains(@class, "in-stock")]').get() is not None
yield {'title': title, 'price': price, 'in_stock': in_stock}
Note the .// in the XPath, that leading . says "from this card's subtree," not from the document root. Forgetting it is the most common Scrapy XPath bug.
Library-specific notes
| Library | CSS support | XPath support |
|---|---|---|
| BeautifulSoup | .select(), fully featured |
Use lxml backend; or soupsieve for limited XPath |
| lxml | .cssselect(), solid |
Native .xpath(), top-class |
| Scrapy | .css() selectors |
.xpath() selectors, both equally first-class |
| Playwright | page.locator(css) |
page.locator("xpath=//...") |
| Selenium | By.CSS_SELECTOR |
By.XPATH |
| Symfony DomCrawler | .filter() for CSS |
.filterXPath() for XPath |
| PHP DOMDocument | Not built-in | DOMXPath, native |
Notable: BeautifulSoup is CSS-first; if you need XPath in a BS4 pipeline, switch to lxml directly for that step. Mixing is fine, your scraper isn't worse for using two libraries.
A practical mental flowchart
Need to find an element?
│
├─ Stable id, class, or data-* attribute? → CSS
├─ Position-based (nth child, first/last)? → CSS
├─ "All <span> inside .product"? → CSS (descendant)
├─ Need to walk UP to a parent? → XPath
├─ Need to match by text content? → XPath (or library helper)
└─ Need "next sibling matching X"? → XPath
When you find yourself reaching for XPath, ask once: "Can I anchor on something more stable and avoid this?" If yes, refactor. If no, XPath.
Hands-on lab
Open practice.scrapingcentral.com/products/1-yellow-ceramic-mug and try to extract:
- The product title
- The price
- Every review where the rating is exactly 5 stars
- The total review count (which is written as text on the page near the reviews heading)
For each, try CSS first. Where you fail (or it gets ugly), switch to XPath. By the end you'll have a strong intuition for which tool to reach for in which situation, not theoretical, hands-on.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/products/1-yellow-ceramic-mugQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.