HTML Parsing with BeautifulSoup - Complete Guide - Data Parsing

Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.

BeautifulSoup is the most popular Python library for parsing HTML and XML. It turns messy web pages into a navigable tree of Python objects.

Installation

pip install beautifulsoup4 lxml

Use lxml as the parser, it is significantly faster than the built-in html.parser.

Basic Parsing

from bs4 import BeautifulSoup

html = """
<html>
<body>
  <h1 class="title">Scraping Central</h1>
  <div class="products">
    <div class="product" data-id="1">
      <span class="name">Proxy Service</span>
      <span class="price">$29.99</span>
    </div>
    <div class="product" data-id="2">
      <span class="name">Scraping API</span>
      <span class="price">$49.99</span>
    </div>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

Finding Elements

# Single element
title = soup.find("h1", class_="title")
print(title.text)  # "Scraping Central"

# All matching elements
products = soup.find_all("div", class_="product")
for product in products:
    name = product.find("span", class_="name").text
    price = product.find("span", class_="price").text
    data_id = product["data-id"]
    print(f"[{data_id}] {name}: {price}")

[1] Proxy Service: $29.99
[2] Scraping API: $49.99

CSS Selectors

The select method supports CSS selectors, which are often more concise:

# Select by class
prices = soup.select(".product .price")
for p in prices:
    print(p.text)

# Select by attribute
products = soup.select("div[data-id]")

# Select nth child
first_product = soup.select_one(".product:first-child .name")

# Combined selectors
links = soup.select("nav > ul > li > a[href]")

Extracting Data

from bs4 import BeautifulSoup
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

quotes = []
for q in soup.select(".quote"):
    quotes.append({
        "text": q.select_one(".text").get_text(strip=True),
        "author": q.select_one(".author").get_text(strip=True),
        "tags": [tag.text for tag in q.select(".tag")],
    })

for quote in quotes[:3]:
    print(f'"{quote["text"][:60]}..." - {quote["author"]}')
    print(f'  Tags: {", ".join(quote["tags"])}')

Common Methods Reference

Method	Purpose	Example
`find()`	First matching element	`soup.find("a", href=True)`
`find_all()`	All matching elements	`soup.find_all("p")`
`select_one()`	First CSS match	`soup.select_one(".price")`
`select()`	All CSS matches	`soup.select("table tr")`
`.text` / `.get_text()`	Extract text content	`el.get_text(strip=True)`
`.attrs`	Get all attributes	`el.attrs` -> `{"href": "..."}`
`["attr"]`	Get specific attribute	`el["href"]`
`.get("attr")`	Safe attribute access	`el.get("href", "")`
`.parent`	Parent element	`el.parent`
`.children`	Direct children	`list(el.children)`

Handling Missing Elements

Always guard against elements that might not exist:

# Bad: crashes if element not found
title = soup.find("h2").text  # AttributeError if no h2

# Good: safe access
el = soup.find("h2")
title = el.text if el else "No title"

# Better: use select_one with default
title = getattr(soup.select_one("h2"), "text", "No title")

When scraping pages that require JavaScript rendering before BeautifulSoup can parse them, use ScrapingAnt to get fully rendered HTML.

Next Steps

Learn CSS selectors vs XPath for targeting elements
Parse HTML tables into pandas DataFrames
Handle malformed and broken HTML