XPath Expressions for Web Scraping - Data Parsing

Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.

XPath is a query language for selecting nodes in XML/HTML documents. It is more powerful than CSS selectors, supporting text matching, mathematical operations, and bidirectional tree traversal.

Setup

pip install lxml requests

from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

Essential XPath Syntax

# All div elements anywhere in the document
tree.xpath("//div")

# Direct children of body
tree.xpath("/html/body/div")

# Element with specific class
tree.xpath('//div[@class="quote"]')

# Element with partial class match
tree.xpath('//div[contains(@class, "quote")]')

# Get text content
tree.xpath('//small[@class="author"]/text()')

# Get attribute value
tree.xpath('//a[@class="tag"]/@href')

Predicates (Filtering)

# First element (1-indexed)
tree.xpath('(//div[@class="quote"])[1]')

# Last element
tree.xpath('(//div[@class="quote"])[last()]')

# Position range
tree.xpath('(//div[@class="quote"])[position() <= 3]')

# Element with specific child text
tree.xpath('//div[@class="quote"][.//small[text()="Albert Einstein"]]')

# Element with attribute value containing text
tree.xpath('//a[contains(@href, "page")]/@href')

XPath Axes (Direction)

Axes let you navigate in any direction from the current node:

from lxml import html

doc = html.fromstring("""
<table>
  <tr><th>Name</th><th>Price</th></tr>
  <tr><td>ScraperAPI</td><td>$49</td></tr>
  <tr><td>ScrapingAnt</td><td>$29</td></tr>
</table>
""")

# child:: (default)
doc.xpath('//table/child::tr')

# parent:: go up one level
doc.xpath('//td[text()="ScraperAPI"]/parent::tr/td[2]/text()')
# Returns: ['$49']

# ancestor:: go up to any level
doc.xpath('//td/ancestor::table')

# following-sibling:: next siblings
doc.xpath('//th[text()="Name"]/following-sibling::th/text()')
# Returns: ['Price']

# preceding-sibling:: previous siblings
doc.xpath('//td[text()="$49"]/preceding-sibling::td/text()')
# Returns: ['ScraperAPI']

Useful XPath Functions

from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

# contains() - partial text match
tree.xpath('//span[contains(text(), "world")]/text()')

# starts-with()
tree.xpath('//a[starts-with(@href, "/author")]/text()')

# normalize-space() - trim whitespace
tree.xpath('normalize-space(//h1/text())')

# count() - count matching nodes
count = tree.xpath('count(//div[@class="quote"])')
print(f"Found {int(count)} quotes")

# string-length()
tree.xpath('//span[@class="text"][string-length(text()) > 100]/text()')

# not() - negation
tree.xpath('//div[not(@class)]')

Practical Scraping Example

from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

# Extract structured data using XPath
quotes = tree.xpath('//div[@class="quote"]')

for q in quotes[:5]:
    text = q.xpath('.//span[@class="text"]/text()')[0]
    author = q.xpath('.//small[@class="author"]/text()')[0]
    tags = q.xpath('.//a[@class="tag"]/text()')
    print(f"{author}: {text[:60]}...")
    print(f"  Tags: {', '.join(tags)}\n")

XPath Quick Reference

Expression	Selects
`//div`	All divs anywhere
`//div[@id="main"]`	Div with specific id
`//div[contains(@class, "item")]`	Div with class containing "item"
`//a/text()`	Text inside all links
`//a/@href`	All href attributes
`//div[position()<=5]`	First 5 divs
`//td[2]`	Second td in each row
`//node()[contains(text(), "price")]`	Any node with "price" in text

Next Steps

Parse JSON responses in Python
Use XPath with Scrapy for large-scale scraping
Combine XPath with lxml for maximum performance