Web Scraping with lxml and XPath
Use lxml and XPath expressions for fast, powerful HTML parsing. Learn XPath syntax, axes, and functions for precise data extraction.
Python Scraping · #17intermediate3 min read
lxml is the fastest HTML/XML parser available for Python. Combined with XPath, a powerful query language for navigating XML/HTML trees, it gives you precise control over data extraction.
Installation
pip install lxml requests
Basic lxml + XPath Scraping
import requests
from lxml import html
response = requests.get("https://quotes.toscrape.com/")
tree = html.fromstring(response.content)
# Extract quotes using XPath
quotes = tree.xpath('//div[@class="quote"]')
for quote in quotes:
text = quote.xpath('.//span[@class="text"]/text()')[0]
author = quote.xpath('.//small[@class="author"]/text()')[0]
tags = quote.xpath('.//a[@class="tag"]/text()')
print(f"{author}: {text[:50]}... [{', '.join(tags)}]")
XPath Syntax Cheat Sheet
| Expression | Meaning |
|---|---|
//div |
All div elements anywhere |
/html/body/div |
Absolute path from root |
//div[@class="item"] |
div with class "item" |
//a/@href |
href attribute of all links |
//h2/text() |
Text content of all h2 |
//div[1] |
First div (1-indexed) |
//div[last()] |
Last div |
//div[position()<=3] |
First three div elements |
//div[contains(@class, "product")] |
Class contains "product" |
//a[starts-with(@href, "/products")] |
Href starts with "/products" |
Advanced XPath Expressions
from lxml import html
import requests
response = requests.get("https://books.toscrape.com/")
tree = html.fromstring(response.content)
# Get all book titles
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')
# Get prices (text content)
prices = tree.xpath('//p[@class="price_color"]/text()')
# Get links that start with "catalogue/"
links = tree.xpath('//a[starts-with(@href, "catalogue/")]/@href')
# Get books with "Three" rating
three_star = tree.xpath(
'//article[.//p[contains(@class, "Three")]]//h3/a/@title'
)
print(f"Total books: {len(titles)}")
print(f"Three-star books: {len(three_star)}")
for title, price in zip(titles[:5], prices[:5]):
print(f" {title}: {price}")
XPath Axes for Complex Navigation
Axes let you traverse the document relative to the current node.
from lxml import html
page = """
<div class="product">
<h3>Widget Pro</h3>
<div class="details">
<span class="price">$29.99</span>
<span class="stock">In Stock</span>
</div>
<div class="reviews">
<p>Great product!</p>
</div>
</div>
"""
tree = html.fromstring(page)
# parent::, go up to the parent element
price_parent = tree.xpath('//span[@class="price"]/parent::div/@class')
print(f"Price parent: {price_parent}") # ['details']
# following-sibling::, next sibling element
after_price = tree.xpath(
'//span[@class="price"]/following-sibling::span/text()'
)
print(f"After price: {after_price}") # ['In Stock']
# ancestor::, any ancestor element
ancestors = tree.xpath('//span[@class="price"]/ancestor::div/@class')
print(f"Ancestors: {ancestors}") # ['product', 'details']
lxml vs BeautifulSoup
| Feature | lxml | BeautifulSoup |
|---|---|---|
| Speed | Very fast (C-based) | Slower (pure Python) |
| Query language | XPath + CSS | CSS selectors |
| Broken HTML handling | Good | Better |
| Learning curve | Steeper (XPath) | Easier |
Tips
- Use
html.fromstring(response.content)(bytes) instead ofresponse.text(string) to avoid encoding issues. - XPath is 1-indexed, not 0-indexed, the first element is
[1], not[0]. - Combine lxml with ScraperAPI for fetching pages and lxml for fast parsing, a powerful combination for high-volume scraping.
- Use
tree.xpath('//meta[@name="description"]/@content')to extract meta tags for SEO data scraping.
Next Steps
- Learn to extract data from HTML tables with lxml
- Build scrapers that combine XPath for extraction and databases for storage