XPath, Complete Reference, Foundations

The more powerful, less familiar query language for DOM nodes. When CSS runs out, XPath keeps going.

XPath is a query language for trees. It's older than CSS selectors and more powerful: it can walk up the tree, match by text content, and express conditions CSS can't. Most scrapers use both, CSS for the common case, XPath for the awkward 10%.

The basic syntax

XPath queries look like Unix paths:

/html/body/div[1]/p[2]

Read left to right: "from the root, into html, into body, into the first div, get the second p." A / is a child step; a // is a descendant step (any depth).

Two forms you'll write 95% of the time:

//article[@class="product"]  every <article class="product"> in the document
//a[contains(@href, "/products/")]  every <a> whose href contains /products/

Note: XPath is 1-indexed. [1] is the first match. CSS uses :nth-child(1); both are 1-based.

How XPath differs from CSS

Capability	CSS	XPath
Find by tag, class, id, attribute	✅	✅
Find by position in parent	✅ (`:nth-child`)	✅ (`[N]`)
Walk up to a parent / ancestor	❌	✅ (axes)
Match by text content	Awkward (library-specific)	✅ (`contains(text(),"...")`)
Walk to following / preceding siblings	✅ (`+`, `~`)	✅ (axes, both directions)
Boolean combinations (`and`, `or`)	Limited	✅
Compute on values	❌	✅ (XPath 2.0+)

The killer feature is going up. //span[@class="price"]/ancestor::article says "from the price span, walk up to the enclosing article." CSS has no analogue, you have to find the article first and then descend.

The axes (what makes XPath powerful)

An axis tells XPath which direction to walk from the current node:

Axis	Direction	Example
`child::`	Default, direct children	`/html/child::body` (same as `/html/body`)
`descendant::`	All descendants, any depth	`//article//span` (same as `//article/descendant::span`)
`parent::`	Walk up one level	`//span[@class="price"]/parent::*`
`ancestor::`	All the way up	`//span[@class="price"]/ancestor::article`
`following-sibling::`	Next siblings	`//h2/following-sibling::p[1]`
`preceding-sibling::`	Previous siblings	`//p/preceding-sibling::h2[1]`
`following::`	Everything later in document order	rare
`preceding::`	Everything earlier	rare
`self::`	The current node	`//p[self::p]` (redundant, for explicit clarity)

Most queries don't need explicit axes, / and // cover children and descendants. You reach for axes when the data you want is next to something stable, not inside it.

Predicates: filtering matches

The square brackets after a step filter the matches:

//article[@class="product"]  by attribute
//article[@class="product" and @data-stock="in-stock"]  by multiple attributes
//article[@class="product"][2]  the second one
//article[@class="product"][position()=2] same thing, explicit
//article[@class="product"][last()]  the last one
//a[contains(@href, "/products/")]  href contains substring
//a[starts-with(@href, "https://")]  href starts with prefix
//p[contains(text(), "Free shipping")]  text content contains
//*[@id="main"]  any element with id="main"

Predicates chain. //tr[td[1]="2024"][td[2]="Active"] says "any <tr> whose first <td> is '2024' AND whose second <td> is 'Active'." That's expressive in a way CSS can't match.

Text matching, the real superpower

CSS can't match on text content portably. XPath can:

//button[normalize-space(text())="Add to cart"]  exact match (with whitespace normalized)
//button[contains(., "Add to cart")]  text anywhere inside the button
//button[contains(text(), "Add to")]  text in a direct text-node child

normalize-space() collapses runs of whitespace to a single space and trims, essential, because real HTML is full of stray newlines and tabs.

text() vs .: text() selects only direct child text nodes; . is the string value of the whole subtree (text from all descendants concatenated). Use . when the text is wrapped in a  or .

Common gotchas

// matches too greedily inside predicates. //div[.//span="$14.99"] matches a <div> that contains a  with that text, but the predicate .//span resets to the current node, so it works correctly. The trap is forgetting that //span (with no .) inside a predicate jumps back to document root.
text() is whitespace-sensitive.  Hello  has text Hello . Wrap with normalize-space().
XPath 1.0 vs 2.0. Most libraries (lxml, browsers, Playwright) implement XPath 1.0. Some features in references and tutorials are 2.0-only and silently fail in your scraper.
Namespaces. XML namespaces complicate XPath. For HTML you don't usually care, lxml's etree.fromstring(html, parser=etree.HTMLParser()) produces a tree without namespaces.

Side-by-side with CSS

Same intent, both selectors:

Goal	CSS	XPath
Every `<article>` of class `product`	`article.product`	`//article[@class="product"]`
Every product with `data-stock="in-stock"`	`article.product[data-stock="in-stock"]`	`//article[@class="product" and @data-stock="in-stock"]`
Third product on page	`article.product:nth-of-type(3)`	`(//article[@class="product"])[3]`
Price element near a specific product name	(multi-step)	`//h2[contains(text(),"Yellow mug")]/following-sibling::p[@class="price"]`
Walk from price up to its product card	impossible	`//span[@class="price"]/ancestor::article[@class="product"]`

The last two are the XPath wins. The first three are equally easy in CSS.

Hands-on lab

Open practice.scrapingcentral.com/challenges/static/tables/nested, a deliberately gnarly table with merged cells and a nested sub-table. Write an XPath that extracts the data from the inner table without including the outer's headers, and then write the equivalent CSS to compare difficulty. The nested-table case is where XPath's ancestor:: and predicate-chaining noticeably win over CSS.

XPath, Complete Reference

What you’ll learn

The basic syntax

How XPath differs from CSS

The axes (what makes XPath powerful)

Predicates: filtering matches

Text matching, the real superpower

Common gotchas

Side-by-side with CSS

Hands-on lab

Hands-on lab

Quiz, check your understanding

What does `//article[@class="product"]` match?