XPath, Complete Reference
The more powerful, less familiar query language for DOM nodes. When CSS runs out, XPath keeps going.
What you’ll learn
- Read and write XPath queries that target by tag, attribute, position, and text.
- Use XPath axes to walk up, sideways, or across the tree (something CSS literally cannot do).
- Combine predicates to express conditions CSS can't express in one selector.
- Know when XPath beats CSS and when it doesn't.
XPath is a query language for trees. It's older than CSS selectors and more powerful: it can walk up the tree, match by text content, and express conditions CSS can't. Most scrapers use both, CSS for the common case, XPath for the awkward 10%.
The basic syntax
XPath queries look like Unix paths:
/html/body/div[1]/p[2]
Read left to right: "from the root, into html, into body, into the first div, get the second p." A / is a child step; a // is a descendant step (any depth).
Two forms you'll write 95% of the time:
//article[@class="product"] every <article class="product"> in the document
//a[contains(@href, "/products/")] every <a> whose href contains /products/
Note: XPath is 1-indexed. [1] is the first match. CSS uses :nth-child(1); both are 1-based.
How XPath differs from CSS
| Capability | CSS | XPath |
|---|---|---|
| Find by tag, class, id, attribute | ✅ | ✅ |
| Find by position in parent | ✅ (:nth-child) |
✅ ([N]) |
| Walk up to a parent / ancestor | ❌ | ✅ (axes) |
| Match by text content | Awkward (library-specific) | ✅ (contains(text(),"...")) |
| Walk to following / preceding siblings | ✅ (+, ~) |
✅ (axes, both directions) |
Boolean combinations (and, or) |
Limited | ✅ |
| Compute on values | ❌ | ✅ (XPath 2.0+) |
The killer feature is going up. //span[@class="price"]/ancestor::article says "from the price span, walk up to the enclosing article." CSS has no analogue, you have to find the article first and then descend.
The axes (what makes XPath powerful)
An axis tells XPath which direction to walk from the current node:
| Axis | Direction | Example |
|---|---|---|
child:: |
Default, direct children | /html/child::body (same as /html/body) |
descendant:: |
All descendants, any depth | //article//span (same as //article/descendant::span) |
parent:: |
Walk up one level | //span[@class="price"]/parent::* |
ancestor:: |
All the way up | //span[@class="price"]/ancestor::article |
following-sibling:: |
Next siblings | //h2/following-sibling::p[1] |
preceding-sibling:: |
Previous siblings | //p/preceding-sibling::h2[1] |
following:: |
Everything later in document order | rare |
preceding:: |
Everything earlier | rare |
self:: |
The current node | //p[self::p] (redundant, for explicit clarity) |
Most queries don't need explicit axes, / and // cover children and descendants. You reach for axes when the data you want is next to something stable, not inside it.
Predicates: filtering matches
The square brackets after a step filter the matches:
//article[@class="product"] by attribute
//article[@class="product" and @data-stock="in-stock"] by multiple attributes
//article[@class="product"][2] the second one
//article[@class="product"][position()=2] same thing, explicit
//article[@class="product"][last()] the last one
//a[contains(@href, "/products/")] href contains substring
//a[starts-with(@href, "https://")] href starts with prefix
//p[contains(text(), "Free shipping")] text content contains
//*[@id="main"] any element with id="main"
Predicates chain. //tr[td[1]="2024"][td[2]="Active"] says "any <tr> whose first <td> is '2024' AND whose second <td> is 'Active'." That's expressive in a way CSS can't match.
Text matching, the real superpower
CSS can't match on text content portably. XPath can:
//button[normalize-space(text())="Add to cart"] exact match (with whitespace normalized)
//button[contains(., "Add to cart")] text anywhere inside the button
//button[contains(text(), "Add to")] text in a direct text-node child
normalize-space() collapses runs of whitespace to a single space and trims, essential, because real HTML is full of stray newlines and tabs.
text() vs .: text() selects only direct child text nodes; . is the string value of the whole subtree (text from all descendants concatenated). Use . when the text is wrapped in a <span> or <strong>.
Common gotchas
-
//matches too greedily inside predicates.//div[.//span="$14.99"]matches a<div>that contains a<span>with that text, but the predicate.//spanresets to the current node, so it works correctly. The trap is forgetting that//span(with no.) inside a predicate jumps back to document root. -
text()is whitespace-sensitive.<p> Hello </p>has textHello. Wrap withnormalize-space(). -
XPath 1.0 vs 2.0. Most libraries (lxml, browsers, Playwright) implement XPath 1.0. Some features in references and tutorials are 2.0-only and silently fail in your scraper.
-
Namespaces. XML namespaces complicate XPath. For HTML you don't usually care, lxml's
etree.fromstring(html, parser=etree.HTMLParser())produces a tree without namespaces.
Side-by-side with CSS
Same intent, both selectors:
| Goal | CSS | XPath |
|---|---|---|
Every <article> of class product |
article.product |
//article[@class="product"] |
Every product with data-stock="in-stock" |
article.product[data-stock="in-stock"] |
//article[@class="product" and @data-stock="in-stock"] |
| Third product on page | article.product:nth-of-type(3) |
(//article[@class="product"])[3] |
| Price element near a specific product name | (multi-step) | //h2[contains(text(),"Yellow mug")]/following-sibling::p[@class="price"] |
| Walk from price up to its product card | impossible | //span[@class="price"]/ancestor::article[@class="product"] |
The last two are the XPath wins. The first three are equally easy in CSS.
Hands-on lab
Open practice.scrapingcentral.com/challenges/static/tables/nested, a deliberately gnarly table with merged cells and a nested sub-table. Write an XPath that extracts the data from the inner table without including the outer's headers, and then write the equivalent CSS to compare difficulty. The nested-table case is where XPath's ancestor:: and predicate-chaining noticeably win over CSS.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/tables/nestedQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.