HTML Parsing with BeautifulSoup - Complete Guide
Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.
Data Parsing · #1beginner3 min read
BeautifulSoup is the most popular Python library for parsing HTML and XML. It turns messy web pages into a navigable tree of Python objects.
Installation
pip install beautifulsoup4 lxml
Use lxml as the parser, it is significantly faster than the built-in html.parser.
Basic Parsing
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1 class="title">Scraping Central</h1>
<div class="products">
<div class="product" data-id="1">
<span class="name">Proxy Service</span>
<span class="price">$29.99</span>
</div>
<div class="product" data-id="2">
<span class="name">Scraping API</span>
<span class="price">$49.99</span>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml")
Finding Elements
# Single element
title = soup.find("h1", class_="title")
print(title.text) # "Scraping Central"
# All matching elements
products = soup.find_all("div", class_="product")
for product in products:
name = product.find("span", class_="name").text
price = product.find("span", class_="price").text
data_id = product["data-id"]
print(f"[{data_id}] {name}: {price}")
[1] Proxy Service: $29.99
[2] Scraping API: $49.99
CSS Selectors
The select method supports CSS selectors, which are often more concise:
# Select by class
prices = soup.select(".product .price")
for p in prices:
print(p.text)
# Select by attribute
products = soup.select("div[data-id]")
# Select nth child
first_product = soup.select_one(".product:first-child .name")
# Combined selectors
links = soup.select("nav > ul > li > a[href]")
Extracting Data
from bs4 import BeautifulSoup
import requests
response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")
quotes = []
for q in soup.select(".quote"):
quotes.append({
"text": q.select_one(".text").get_text(strip=True),
"author": q.select_one(".author").get_text(strip=True),
"tags": [tag.text for tag in q.select(".tag")],
})
for quote in quotes[:3]:
print(f'"{quote["text"][:60]}..." - {quote["author"]}')
print(f' Tags: {", ".join(quote["tags"])}')
Common Methods Reference
| Method | Purpose | Example |
|---|---|---|
find() |
First matching element | soup.find("a", href=True) |
find_all() |
All matching elements | soup.find_all("p") |
select_one() |
First CSS match | soup.select_one(".price") |
select() |
All CSS matches | soup.select("table tr") |
.text / .get_text() |
Extract text content | el.get_text(strip=True) |
.attrs |
Get all attributes | el.attrs -> {"href": "..."} |
["attr"] |
Get specific attribute | el["href"] |
.get("attr") |
Safe attribute access | el.get("href", "") |
.parent |
Parent element | el.parent |
.children |
Direct children | list(el.children) |
Handling Missing Elements
Always guard against elements that might not exist:
# Bad: crashes if element not found
title = soup.find("h2").text # AttributeError if no h2
# Good: safe access
el = soup.find("h2")
title = el.text if el else "No title"
# Better: use select_one with default
title = getattr(soup.select_one("h2"), "text", "No title")
When scraping pages that require JavaScript rendering before BeautifulSoup can parse them, use ScrapingAnt to get fully rendered HTML.
Next Steps
- Learn CSS selectors vs XPath for targeting elements
- Parse HTML tables into pandas DataFrames
- Handle malformed and broken HTML