XPath Expressions for Web Scraping
Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.
Data Parsing · #3intermediate3 min read
XPath is a query language for selecting nodes in XML/HTML documents. It is more powerful than CSS selectors, supporting text matching, mathematical operations, and bidirectional tree traversal.
Setup
pip install lxml requests
from lxml import html
import requests
response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)
Essential XPath Syntax
# All div elements anywhere in the document
tree.xpath("//div")
# Direct children of body
tree.xpath("/html/body/div")
# Element with specific class
tree.xpath('//div[@class="quote"]')
# Element with partial class match
tree.xpath('//div[contains(@class, "quote")]')
# Get text content
tree.xpath('//small[@class="author"]/text()')
# Get attribute value
tree.xpath('//a[@class="tag"]/@href')
Predicates (Filtering)
# First element (1-indexed)
tree.xpath('(//div[@class="quote"])[1]')
# Last element
tree.xpath('(//div[@class="quote"])[last()]')
# Position range
tree.xpath('(//div[@class="quote"])[position() <= 3]')
# Element with specific child text
tree.xpath('//div[@class="quote"][.//small[text()="Albert Einstein"]]')
# Element with attribute value containing text
tree.xpath('//a[contains(@href, "page")]/@href')
XPath Axes (Direction)
Axes let you navigate in any direction from the current node:
from lxml import html
doc = html.fromstring("""
<table>
<tr><th>Name</th><th>Price</th></tr>
<tr><td>ScraperAPI</td><td>$49</td></tr>
<tr><td>ScrapingAnt</td><td>$29</td></tr>
</table>
""")
# child:: (default)
doc.xpath('//table/child::tr')
# parent:: go up one level
doc.xpath('//td[text()="ScraperAPI"]/parent::tr/td[2]/text()')
# Returns: ['$49']
# ancestor:: go up to any level
doc.xpath('//td/ancestor::table')
# following-sibling:: next siblings
doc.xpath('//th[text()="Name"]/following-sibling::th/text()')
# Returns: ['Price']
# preceding-sibling:: previous siblings
doc.xpath('//td[text()="$49"]/preceding-sibling::td/text()')
# Returns: ['ScraperAPI']
Useful XPath Functions
from lxml import html
import requests
response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)
# contains() - partial text match
tree.xpath('//span[contains(text(), "world")]/text()')
# starts-with()
tree.xpath('//a[starts-with(@href, "/author")]/text()')
# normalize-space() - trim whitespace
tree.xpath('normalize-space(//h1/text())')
# count() - count matching nodes
count = tree.xpath('count(//div[@class="quote"])')
print(f"Found {int(count)} quotes")
# string-length()
tree.xpath('//span[@class="text"][string-length(text()) > 100]/text()')
# not() - negation
tree.xpath('//div[not(@class)]')
Practical Scraping Example
from lxml import html
import requests
response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)
# Extract structured data using XPath
quotes = tree.xpath('//div[@class="quote"]')
for q in quotes[:5]:
text = q.xpath('.//span[@class="text"]/text()')[0]
author = q.xpath('.//small[@class="author"]/text()')[0]
tags = q.xpath('.//a[@class="tag"]/text()')
print(f"{author}: {text[:60]}...")
print(f" Tags: {', '.join(tags)}\n")
XPath Quick Reference
| Expression | Selects |
|---|---|
//div |
All divs anywhere |
//div[@id="main"] |
Div with specific id |
//div[contains(@class, "item")] |
Div with class containing "item" |
//a/text() |
Text inside all links |
//a/@href |
All href attributes |
//div[position()<=5] |
First 5 divs |
//td[2] |
Second td in each row |
//node()[contains(text(), "price")] |
Any node with "price" in text |
Next Steps
- Parse JSON responses in Python
- Use XPath with Scrapy for large-scale scraping
- Combine XPath with lxml for maximum performance