Web Scraping with lxml and XPath - Python Scraping

Use lxml and XPath expressions for fast, powerful HTML parsing. Learn XPath syntax, axes, and functions for precise data extraction.

lxml is the fastest HTML/XML parser available for Python. Combined with XPath, a powerful query language for navigating XML/HTML trees, it gives you precise control over data extraction.

Installation

pip install lxml requests

Basic lxml + XPath Scraping

import requests
from lxml import html

response = requests.get("https://quotes.toscrape.com/")
tree = html.fromstring(response.content)

# Extract quotes using XPath
quotes = tree.xpath('//div[@class="quote"]')

for quote in quotes:
    text = quote.xpath('.//span[@class="text"]/text()')[0]
    author = quote.xpath('.//small[@class="author"]/text()')[0]
    tags = quote.xpath('.//a[@class="tag"]/text()')
    print(f"{author}: {text[:50]}... [{', '.join(tags)}]")

XPath Syntax Cheat Sheet

Expression	Meaning
`//div`	All `div` elements anywhere
`/html/body/div`	Absolute path from root
`//div[@class="item"]`	`div` with class "item"
`//a/@href`	`href` attribute of all links
`//h2/text()`	Text content of all `h2`
`//div[1]`	First `div` (1-indexed)
`//div[last()]`	Last `div`
`//div[position()<=3]`	First three `div` elements
`//div[contains(@class, "product")]`	Class contains "product"
`//a[starts-with(@href, "/products")]`	Href starts with "/products"

Advanced XPath Expressions

from lxml import html
import requests

response = requests.get("https://books.toscrape.com/")
tree = html.fromstring(response.content)

# Get all book titles
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')

# Get prices (text content)
prices = tree.xpath('//p[@class="price_color"]/text()')

# Get links that start with "catalogue/"
links = tree.xpath('//a[starts-with(@href, "catalogue/")]/@href')

# Get books with "Three" rating
three_star = tree.xpath(
    '//article[.//p[contains(@class, "Three")]]//h3/a/@title'
)

print(f"Total books: {len(titles)}")
print(f"Three-star books: {len(three_star)}")

for title, price in zip(titles[:5], prices[:5]):
    print(f"  {title}: {price}")

XPath Axes for Complex Navigation

Axes let you traverse the document relative to the current node.

from lxml import html

page = """
<div class="product">
    <h3>Widget Pro</h3>
    <div class="details">
        <span class="price">$29.99</span>
        <span class="stock">In Stock</span>
    </div>
    <div class="reviews">
        <p>Great product!</p>
    </div>
</div>
"""

tree = html.fromstring(page)

# parent::, go up to the parent element
price_parent = tree.xpath('//span[@class="price"]/parent::div/@class')
print(f"Price parent: {price_parent}")  # ['details']

# following-sibling::, next sibling element
after_price = tree.xpath(
    '//span[@class="price"]/following-sibling::span/text()'
)
print(f"After price: {after_price}")  # ['In Stock']

# ancestor::, any ancestor element
ancestors = tree.xpath('//span[@class="price"]/ancestor::div/@class')
print(f"Ancestors: {ancestors}")  # ['product', 'details']

lxml vs BeautifulSoup

Feature	lxml	BeautifulSoup
Speed	Very fast (C-based)	Slower (pure Python)
Query language	XPath + CSS	CSS selectors
Broken HTML handling	Good	Better
Learning curve	Steeper (XPath)	Easier

Tips

Use html.fromstring(response.content) (bytes) instead of response.text (string) to avoid encoding issues.
XPath is 1-indexed, not 0-indexed, the first element is [1], not [0].
Combine lxml with ScraperAPI for fetching pages and lxml for fast parsing, a powerful combination for high-volume scraping.
Use tree.xpath('//meta[@name="description"]/@content') to extract meta tags for SEO data scraping.

Next Steps

Learn to extract data from HTML tables with lxml
Build scrapers that combine XPath for extraction and databases for storage