Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.15intermediate5 min read

lxml and XPath in Python, 10x Faster

When BeautifulSoup is too slow or the structure too irregular, drop down to lxml directly. XPath gives you axes and predicates BeautifulSoup can't match.

What you’ll learn

  • Parse HTML directly with `lxml.html` and traverse the tree.
  • Write XPath expressions for the common patterns: axes, predicates, position, text matching.
  • Combine XPath with CSS via `lxml.cssselect`.
  • Decide when lxml beats BeautifulSoup and when it doesn't.

BeautifulSoup is comfortable. lxml is fast and powerful. They're not mutually exclusive, BeautifulSoup actually uses lxml under the hood when you ask for the "lxml" parser. But for performance-critical scrapers and for tasks where XPath wins on expressiveness, working directly with lxml.html is the right call.

Install

pip install lxml cssselect

cssselect enables CSS selectors on lxml elements (so you don't HAVE to write XPath everywhere).

Parsing HTML

import lxml.html

html_text = "<html><body><h1>Hi</h1></body></html>"
tree = lxml.html.fromstring(html_text)

print(tree.tag)  # 'html'
print(tree.xpath("//h1/text()"))  # ['Hi']

fromstring parses a fragment or full document and returns the root element. For full-document semantics (with <doctype> etc.) use lxml.html.parse(file_or_url).

From a requests response:

import requests, lxml.html
r = requests.get("https://practice.scrapingcentral.com/products")
tree = lxml.html.fromstring(r.content)

Use r.content (bytes), not r.text (str), lxml is faster on bytes and respects the document's own encoding declaration.

XPath in 60 seconds

XPath is a query language for XML/HTML trees:

tree.xpath("//h1")  # every h1, anywhere
tree.xpath("//div[@class='card']")  # divs with exact class='card'
tree.xpath("//a/@href")  # extract href attribute values directly
tree.xpath("//p/text()")  # text nodes inside <p>
tree.xpath("//li[1]")  # FIRST li at each level (XPath is 1-indexed!)
tree.xpath("//li[last()]")  # last li at each level
tree.xpath("//div[contains(@class, 'card')]")  # class CONTAINS 'card'
tree.xpath("//div[contains(., 'Yellow')]")  # text-contains
tree.xpath("//tr[td[contains(., 'Price')]]")  # tr that has a td containing 'Price'

Three key gotchas:

  1. XPath is 1-indexed, not 0. //li[1] is the first.
  2. //li[1] means "first li at each level," not "the first li in the document." Use (//li)[1] to mean the latter.
  3. @class='card' matches the FULL attribute string, not one of its tokens. Use contains(concat(' ', normalize-space(@class), ' '), ' card ') to match a single class token reliably.

Axes, what XPath does that CSS can't

XPath has 13 axes; the useful ones for scraping:

Axis Example Meaning
parent .. or parent::div Parent element
ancestor ancestor::article Any ancestor
following-sibling following-sibling::dd[1] Next sibling of type
preceding-sibling preceding-sibling::h2 Previous sibling
following following::*[1] Next anywhere in document order
descendant descendant::a Any descendant (default for //)

Example, the <dd> that follows a <dt> with text "Brand":

tree.xpath("//dt[normalize-space(text())='Brand']/following-sibling::dd[1]/text()")

CSS can do a basic version with +/~, but it can't filter on text content. XPath can.

text() vs string() vs .text_content()

Three ways to extract text, with different behaviour:

tree.xpath("//p[1]/text()")
# Returns the DIRECT text children of <p>, as a list of strings
# Doesn't include text inside nested tags

tree.xpath("string(//p[1])")
# Returns the concatenated string of <p>, INCLUDING all descendant text

el = tree.xpath("//p[1]")[0]
el.text_content()
# lxml method, same as string(); recursive text

For "give me the visible text of this element": prefer .text_content() or string(...). text() is useful only when you specifically want to skip nested elements.

A real product-card scrape with lxml + XPath

import requests, lxml.html

r = requests.get("https://practice.scrapingcentral.com/products")
tree = lxml.html.fromstring(r.content)

products = []
for card in tree.xpath("//article[contains(@class, 'product-card')]"):
  products.append({
  "name":  card.xpath(".//h2/text()")[0].strip(),
  "price": card.xpath(".//*[contains(@class, 'price')]/text()")[0].strip(),
  "url":  card.xpath(".//a/@href")[0],
  })

print(products[:3])

Notice the leading . in each per-card XPath, .//h2 means "anywhere inside this card." Without the ., //h2 would search the entire document, ignoring the card scope. This is the #1 lxml bug.

CSS selectors on lxml

You don't have to write XPath for everything:

from lxml.cssselect import CSSSelector

sel = CSSSelector("article.product-card h2")
for h2 in sel(tree):
  print(h2.text_content().strip())

Or the shortcut method:

for card in tree.cssselect("article.product-card"):
  print(card.cssselect("h2")[0].text_content().strip())

cssselect is implemented by translating CSS into XPath under the hood. Use it for simple selectors; switch to native XPath for axes, predicates, or text-content matching.

Performance: when lxml actually wins

For tiny pages, BeautifulSoup with the "lxml" parser and direct lxml.html are within milliseconds of each other. For large pages or large batches, direct lxml is meaningfully faster, often 3-10x, because BeautifulSoup wraps every element in its own Python objects.

Rough rule of thumb:

  • Scraping < 100 pages, BeautifulSoup is fine.
  • Scraping thousands of pages OR pages with thousands of elements each, drop to lxml directly.

Even within BeautifulSoup, ALWAYS pass parser="lxml" instead of the default html.parser. That alone is a 2-5x speedup.

Error tolerance

lxml.html is HTML-aware: it fixes mismatched tags, missing closing tags, badly nested elements. For truly malformed input, the strictness is similar to BeautifulSoup's. For XML-strict requirements (e.g. parsing a sitemap), use lxml.etree instead, it'll refuse broken markup loudly, which is what you want for XML feeds.

Combining lxml and BeautifulSoup

There's no rule against using both in one project. A common pattern: parse with lxml for speed, hand specific subtrees to BeautifulSoup for ergonomic extraction. They share no internal state, but you can serialize a sub-element back to HTML string and re-parse:

import lxml.html
from bs4 import BeautifulSoup

tree = lxml.html.fromstring(html_bytes)
card_html = lxml.html.tostring(card_el, encoding="unicode")
card_soup = BeautifulSoup(card_html, "lxml")

In practice, pick one and stick to it per project. Mixing is a debugging headache.

Hands-on lab

The /challenges/static/tables/nested page contains tables-within-tables. Use lxml XPath to extract ONLY the outer-table rows, ignoring the nested inner tables. Then write the same query with BeautifulSoup's recursive=False for comparison. Time both implementations across 100 iterations and observe the difference.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/static/tables/nested

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

lxml and XPath in Python, 10x Faster1 / 8

What does `tree.xpath('//li[1]')` return?

Score so far: 0 / 0