Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

XPath Expressions for Web Scraping

Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.

Data Parsing · #3intermediate3 min read
Share:WhatsAppLinkedIn

XPath is a query language for selecting nodes in XML/HTML documents. It is more powerful than CSS selectors, supporting text matching, mathematical operations, and bidirectional tree traversal.

Setup

pip install lxml requests
from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

Essential XPath Syntax

# All div elements anywhere in the document
tree.xpath("//div")

# Direct children of body
tree.xpath("/html/body/div")

# Element with specific class
tree.xpath('//div[@class="quote"]')

# Element with partial class match
tree.xpath('//div[contains(@class, "quote")]')

# Get text content
tree.xpath('//small[@class="author"]/text()')

# Get attribute value
tree.xpath('//a[@class="tag"]/@href')

Predicates (Filtering)

# First element (1-indexed)
tree.xpath('(//div[@class="quote"])[1]')

# Last element
tree.xpath('(//div[@class="quote"])[last()]')

# Position range
tree.xpath('(//div[@class="quote"])[position() <= 3]')

# Element with specific child text
tree.xpath('//div[@class="quote"][.//small[text()="Albert Einstein"]]')

# Element with attribute value containing text
tree.xpath('//a[contains(@href, "page")]/@href')

XPath Axes (Direction)

Axes let you navigate in any direction from the current node:

from lxml import html

doc = html.fromstring("""
<table>
  <tr><th>Name</th><th>Price</th></tr>
  <tr><td>ScraperAPI</td><td>$49</td></tr>
  <tr><td>ScrapingAnt</td><td>$29</td></tr>
</table>
""")

# child:: (default)
doc.xpath('//table/child::tr')

# parent:: go up one level
doc.xpath('//td[text()="ScraperAPI"]/parent::tr/td[2]/text()')
# Returns: ['$49']

# ancestor:: go up to any level
doc.xpath('//td/ancestor::table')

# following-sibling:: next siblings
doc.xpath('//th[text()="Name"]/following-sibling::th/text()')
# Returns: ['Price']

# preceding-sibling:: previous siblings
doc.xpath('//td[text()="$49"]/preceding-sibling::td/text()')
# Returns: ['ScraperAPI']

Useful XPath Functions

from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

# contains() - partial text match
tree.xpath('//span[contains(text(), "world")]/text()')

# starts-with()
tree.xpath('//a[starts-with(@href, "/author")]/text()')

# normalize-space() - trim whitespace
tree.xpath('normalize-space(//h1/text())')

# count() - count matching nodes
count = tree.xpath('count(//div[@class="quote"])')
print(f"Found {int(count)} quotes")

# string-length()
tree.xpath('//span[@class="text"][string-length(text()) > 100]/text()')

# not() - negation
tree.xpath('//div[not(@class)]')

Practical Scraping Example

from lxml import html
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
tree = html.fromstring(response.text)

# Extract structured data using XPath
quotes = tree.xpath('//div[@class="quote"]')

for q in quotes[:5]:
    text = q.xpath('.//span[@class="text"]/text()')[0]
    author = q.xpath('.//small[@class="author"]/text()')[0]
    tags = q.xpath('.//a[@class="tag"]/text()')
    print(f"{author}: {text[:60]}...")
    print(f"  Tags: {', '.join(tags)}\n")

XPath Quick Reference

Expression Selects
//div All divs anywhere
//div[@id="main"] Div with specific id
//div[contains(@class, "item")] Div with class containing "item"
//a/text() Text inside all links
//a/@href All href attributes
//div[position()<=5] First 5 divs
//td[2] Second td in each row
//node()[contains(text(), "price")] Any node with "price" in text

Next Steps

  • Parse JSON responses in Python
  • Use XPath with Scrapy for large-scale scraping
  • Combine XPath with lxml for maximum performance