Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F7beginner5 min read

XPath, Complete Reference

The more powerful, less familiar query language for DOM nodes. When CSS runs out, XPath keeps going.

What you’ll learn

  • Read and write XPath queries that target by tag, attribute, position, and text.
  • Use XPath axes to walk up, sideways, or across the tree (something CSS literally cannot do).
  • Combine predicates to express conditions CSS can't express in one selector.
  • Know when XPath beats CSS and when it doesn't.

XPath is a query language for trees. It's older than CSS selectors and more powerful: it can walk up the tree, match by text content, and express conditions CSS can't. Most scrapers use both, CSS for the common case, XPath for the awkward 10%.

The basic syntax

XPath queries look like Unix paths:

/html/body/div[1]/p[2]

Read left to right: "from the root, into html, into body, into the first div, get the second p." A / is a child step; a // is a descendant step (any depth).

Two forms you'll write 95% of the time:

//article[@class="product"]  every <article class="product"> in the document
//a[contains(@href, "/products/")]  every <a> whose href contains /products/

Note: XPath is 1-indexed. [1] is the first match. CSS uses :nth-child(1); both are 1-based.

How XPath differs from CSS

Capability CSS XPath
Find by tag, class, id, attribute
Find by position in parent ✅ (:nth-child) ✅ ([N])
Walk up to a parent / ancestor ✅ (axes)
Match by text content Awkward (library-specific) ✅ (contains(text(),"..."))
Walk to following / preceding siblings ✅ (+, ~) ✅ (axes, both directions)
Boolean combinations (and, or) Limited
Compute on values ✅ (XPath 2.0+)

The killer feature is going up. //span[@class="price"]/ancestor::article says "from the price span, walk up to the enclosing article." CSS has no analogue, you have to find the article first and then descend.

The axes (what makes XPath powerful)

An axis tells XPath which direction to walk from the current node:

Axis Direction Example
child:: Default, direct children /html/child::body (same as /html/body)
descendant:: All descendants, any depth //article//span (same as //article/descendant::span)
parent:: Walk up one level //span[@class="price"]/parent::*
ancestor:: All the way up //span[@class="price"]/ancestor::article
following-sibling:: Next siblings //h2/following-sibling::p[1]
preceding-sibling:: Previous siblings //p/preceding-sibling::h2[1]
following:: Everything later in document order rare
preceding:: Everything earlier rare
self:: The current node //p[self::p] (redundant, for explicit clarity)

Most queries don't need explicit axes, / and // cover children and descendants. You reach for axes when the data you want is next to something stable, not inside it.

Predicates: filtering matches

The square brackets after a step filter the matches:

//article[@class="product"]  by attribute
//article[@class="product" and @data-stock="in-stock"]  by multiple attributes
//article[@class="product"][2]  the second one
//article[@class="product"][position()=2] same thing, explicit
//article[@class="product"][last()]  the last one
//a[contains(@href, "/products/")]  href contains substring
//a[starts-with(@href, "https://")]  href starts with prefix
//p[contains(text(), "Free shipping")]  text content contains
//*[@id="main"]  any element with id="main"

Predicates chain. //tr[td[1]="2024"][td[2]="Active"] says "any <tr> whose first <td> is '2024' AND whose second <td> is 'Active'." That's expressive in a way CSS can't match.

Text matching, the real superpower

CSS can't match on text content portably. XPath can:

//button[normalize-space(text())="Add to cart"]  exact match (with whitespace normalized)
//button[contains(., "Add to cart")]  text anywhere inside the button
//button[contains(text(), "Add to")]  text in a direct text-node child

normalize-space() collapses runs of whitespace to a single space and trims, essential, because real HTML is full of stray newlines and tabs.

text() vs .: text() selects only direct child text nodes; . is the string value of the whole subtree (text from all descendants concatenated). Use . when the text is wrapped in a <span> or <strong>.

Common gotchas

  1. // matches too greedily inside predicates. //div[.//span="$14.99"] matches a <div> that contains a <span> with that text, but the predicate .//span resets to the current node, so it works correctly. The trap is forgetting that //span (with no .) inside a predicate jumps back to document root.

  2. text() is whitespace-sensitive. <p> Hello </p> has text Hello . Wrap with normalize-space().

  3. XPath 1.0 vs 2.0. Most libraries (lxml, browsers, Playwright) implement XPath 1.0. Some features in references and tutorials are 2.0-only and silently fail in your scraper.

  4. Namespaces. XML namespaces complicate XPath. For HTML you don't usually care, lxml's etree.fromstring(html, parser=etree.HTMLParser()) produces a tree without namespaces.

Side-by-side with CSS

Same intent, both selectors:

Goal CSS XPath
Every <article> of class product article.product //article[@class="product"]
Every product with data-stock="in-stock" article.product[data-stock="in-stock"] //article[@class="product" and @data-stock="in-stock"]
Third product on page article.product:nth-of-type(3) (//article[@class="product"])[3]
Price element near a specific product name (multi-step) //h2[contains(text(),"Yellow mug")]/following-sibling::p[@class="price"]
Walk from price up to its product card impossible //span[@class="price"]/ancestor::article[@class="product"]

The last two are the XPath wins. The first three are equally easy in CSS.

Hands-on lab

Open practice.scrapingcentral.com/challenges/static/tables/nested, a deliberately gnarly table with merged cells and a nested sub-table. Write an XPath that extracts the data from the inner table without including the outer's headers, and then write the equivalent CSS to compare difficulty. The nested-table case is where XPath's ancestor:: and predicate-chaining noticeably win over CSS.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/static/tables/nested

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

XPath, Complete Reference1 / 8

What does `//article[@class="product"]` match?

Score so far: 0 / 0