Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Web Scraping with lxml and XPath

Use lxml and XPath expressions for fast, powerful HTML parsing. Learn XPath syntax, axes, and functions for precise data extraction.

Python Scraping · #17intermediate3 min read
Share:WhatsAppLinkedIn

lxml is the fastest HTML/XML parser available for Python. Combined with XPath, a powerful query language for navigating XML/HTML trees, it gives you precise control over data extraction.

Installation

pip install lxml requests

Basic lxml + XPath Scraping

import requests
from lxml import html

response = requests.get("https://quotes.toscrape.com/")
tree = html.fromstring(response.content)

# Extract quotes using XPath
quotes = tree.xpath('//div[@class="quote"]')

for quote in quotes:
    text = quote.xpath('.//span[@class="text"]/text()')[0]
    author = quote.xpath('.//small[@class="author"]/text()')[0]
    tags = quote.xpath('.//a[@class="tag"]/text()')
    print(f"{author}: {text[:50]}... [{', '.join(tags)}]")

XPath Syntax Cheat Sheet

Expression Meaning
//div All div elements anywhere
/html/body/div Absolute path from root
//div[@class="item"] div with class "item"
//a/@href href attribute of all links
//h2/text() Text content of all h2
//div[1] First div (1-indexed)
//div[last()] Last div
//div[position()<=3] First three div elements
//div[contains(@class, "product")] Class contains "product"
//a[starts-with(@href, "/products")] Href starts with "/products"

Advanced XPath Expressions

from lxml import html
import requests

response = requests.get("https://books.toscrape.com/")
tree = html.fromstring(response.content)

# Get all book titles
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')

# Get prices (text content)
prices = tree.xpath('//p[@class="price_color"]/text()')

# Get links that start with "catalogue/"
links = tree.xpath('//a[starts-with(@href, "catalogue/")]/@href')

# Get books with "Three" rating
three_star = tree.xpath(
    '//article[.//p[contains(@class, "Three")]]//h3/a/@title'
)

print(f"Total books: {len(titles)}")
print(f"Three-star books: {len(three_star)}")

for title, price in zip(titles[:5], prices[:5]):
    print(f"  {title}: {price}")

XPath Axes for Complex Navigation

Axes let you traverse the document relative to the current node.

from lxml import html

page = """
<div class="product">
    <h3>Widget Pro</h3>
    <div class="details">
        <span class="price">$29.99</span>
        <span class="stock">In Stock</span>
    </div>
    <div class="reviews">
        <p>Great product!</p>
    </div>
</div>
"""

tree = html.fromstring(page)

# parent::, go up to the parent element
price_parent = tree.xpath('//span[@class="price"]/parent::div/@class')
print(f"Price parent: {price_parent}")  # ['details']

# following-sibling::, next sibling element
after_price = tree.xpath(
    '//span[@class="price"]/following-sibling::span/text()'
)
print(f"After price: {after_price}")  # ['In Stock']

# ancestor::, any ancestor element
ancestors = tree.xpath('//span[@class="price"]/ancestor::div/@class')
print(f"Ancestors: {ancestors}")  # ['product', 'details']

lxml vs BeautifulSoup

Feature lxml BeautifulSoup
Speed Very fast (C-based) Slower (pure Python)
Query language XPath + CSS CSS selectors
Broken HTML handling Good Better
Learning curve Steeper (XPath) Easier

Tips

  • Use html.fromstring(response.content) (bytes) instead of response.text (string) to avoid encoding issues.
  • XPath is 1-indexed, not 0-indexed, the first element is [1], not [0].
  • Combine lxml with ScraperAPI for fetching pages and lxml for fast parsing, a powerful combination for high-volume scraping.
  • Use tree.xpath('//meta[@name="description"]/@content') to extract meta tags for SEO data scraping.

Next Steps

  • Learn to extract data from HTML tables with lxml
  • Build scrapers that combine XPath for extraction and databases for storage