lxml and XPath in Python, 10x Faster
When BeautifulSoup is too slow or the structure too irregular, drop down to lxml directly. XPath gives you axes and predicates BeautifulSoup can't match.
What you’ll learn
- Parse HTML directly with `lxml.html` and traverse the tree.
- Write XPath expressions for the common patterns: axes, predicates, position, text matching.
- Combine XPath with CSS via `lxml.cssselect`.
- Decide when lxml beats BeautifulSoup and when it doesn't.
BeautifulSoup is comfortable. lxml is fast and powerful. They're not mutually exclusive, BeautifulSoup actually uses lxml under the hood when you ask for the "lxml" parser. But for performance-critical scrapers and for tasks where XPath wins on expressiveness, working directly with lxml.html is the right call.
Install
pip install lxml cssselect
cssselect enables CSS selectors on lxml elements (so you don't HAVE to write XPath everywhere).
Parsing HTML
import lxml.html
html_text = "<html><body><h1>Hi</h1></body></html>"
tree = lxml.html.fromstring(html_text)
print(tree.tag) # 'html'
print(tree.xpath("//h1/text()")) # ['Hi']
fromstring parses a fragment or full document and returns the root element. For full-document semantics (with <doctype> etc.) use lxml.html.parse(file_or_url).
From a requests response:
import requests, lxml.html
r = requests.get("https://practice.scrapingcentral.com/products")
tree = lxml.html.fromstring(r.content)
Use r.content (bytes), not r.text (str), lxml is faster on bytes and respects the document's own encoding declaration.
XPath in 60 seconds
XPath is a query language for XML/HTML trees:
tree.xpath("//h1") # every h1, anywhere
tree.xpath("//div[@class='card']") # divs with exact class='card'
tree.xpath("//a/@href") # extract href attribute values directly
tree.xpath("//p/text()") # text nodes inside <p>
tree.xpath("//li[1]") # FIRST li at each level (XPath is 1-indexed!)
tree.xpath("//li[last()]") # last li at each level
tree.xpath("//div[contains(@class, 'card')]") # class CONTAINS 'card'
tree.xpath("//div[contains(., 'Yellow')]") # text-contains
tree.xpath("//tr[td[contains(., 'Price')]]") # tr that has a td containing 'Price'
Three key gotchas:
- XPath is 1-indexed, not 0.
//li[1]is the first. //li[1]means "first li at each level," not "the first li in the document." Use(//li)[1]to mean the latter.@class='card'matches the FULL attribute string, not one of its tokens. Usecontains(concat(' ', normalize-space(@class), ' '), ' card ')to match a single class token reliably.
Axes, what XPath does that CSS can't
XPath has 13 axes; the useful ones for scraping:
| Axis | Example | Meaning |
|---|---|---|
parent |
.. or parent::div |
Parent element |
ancestor |
ancestor::article |
Any ancestor |
following-sibling |
following-sibling::dd[1] |
Next sibling of type |
preceding-sibling |
preceding-sibling::h2 |
Previous sibling |
following |
following::*[1] |
Next anywhere in document order |
descendant |
descendant::a |
Any descendant (default for //) |
Example, the <dd> that follows a <dt> with text "Brand":
tree.xpath("//dt[normalize-space(text())='Brand']/following-sibling::dd[1]/text()")
CSS can do a basic version with +/~, but it can't filter on text content. XPath can.
text() vs string() vs .text_content()
Three ways to extract text, with different behaviour:
tree.xpath("//p[1]/text()")
# Returns the DIRECT text children of <p>, as a list of strings
# Doesn't include text inside nested tags
tree.xpath("string(//p[1])")
# Returns the concatenated string of <p>, INCLUDING all descendant text
el = tree.xpath("//p[1]")[0]
el.text_content()
# lxml method, same as string(); recursive text
For "give me the visible text of this element": prefer .text_content() or string(...). text() is useful only when you specifically want to skip nested elements.
A real product-card scrape with lxml + XPath
import requests, lxml.html
r = requests.get("https://practice.scrapingcentral.com/products")
tree = lxml.html.fromstring(r.content)
products = []
for card in tree.xpath("//article[contains(@class, 'product-card')]"):
products.append({
"name": card.xpath(".//h2/text()")[0].strip(),
"price": card.xpath(".//*[contains(@class, 'price')]/text()")[0].strip(),
"url": card.xpath(".//a/@href")[0],
})
print(products[:3])
Notice the leading . in each per-card XPath, .//h2 means "anywhere inside this card." Without the ., //h2 would search the entire document, ignoring the card scope. This is the #1 lxml bug.
CSS selectors on lxml
You don't have to write XPath for everything:
from lxml.cssselect import CSSSelector
sel = CSSSelector("article.product-card h2")
for h2 in sel(tree):
print(h2.text_content().strip())
Or the shortcut method:
for card in tree.cssselect("article.product-card"):
print(card.cssselect("h2")[0].text_content().strip())
cssselect is implemented by translating CSS into XPath under the hood. Use it for simple selectors; switch to native XPath for axes, predicates, or text-content matching.
Performance: when lxml actually wins
For tiny pages, BeautifulSoup with the "lxml" parser and direct lxml.html are within milliseconds of each other. For large pages or large batches, direct lxml is meaningfully faster, often 3-10x, because BeautifulSoup wraps every element in its own Python objects.
Rough rule of thumb:
- Scraping < 100 pages, BeautifulSoup is fine.
- Scraping thousands of pages OR pages with thousands of elements each, drop to
lxmldirectly.
Even within BeautifulSoup, ALWAYS pass parser="lxml" instead of the default html.parser. That alone is a 2-5x speedup.
Error tolerance
lxml.html is HTML-aware: it fixes mismatched tags, missing closing tags, badly nested elements. For truly malformed input, the strictness is similar to BeautifulSoup's. For XML-strict requirements (e.g. parsing a sitemap), use lxml.etree instead, it'll refuse broken markup loudly, which is what you want for XML feeds.
Combining lxml and BeautifulSoup
There's no rule against using both in one project. A common pattern: parse with lxml for speed, hand specific subtrees to BeautifulSoup for ergonomic extraction. They share no internal state, but you can serialize a sub-element back to HTML string and re-parse:
import lxml.html
from bs4 import BeautifulSoup
tree = lxml.html.fromstring(html_bytes)
card_html = lxml.html.tostring(card_el, encoding="unicode")
card_soup = BeautifulSoup(card_html, "lxml")
In practice, pick one and stick to it per project. Mixing is a debugging headache.
Hands-on lab
The /challenges/static/tables/nested page contains tables-within-tables. Use lxml XPath to extract ONLY the outer-table rows, ignoring the nested inner tables. Then write the same query with BeautifulSoup's recursive=False for comparison. Time both implementations across 100 iterations and observe the difference.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/tables/nestedQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.