Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

HTML Parsing with BeautifulSoup - Complete Guide

Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.

Data Parsing · #1beginner3 min read
Share:WhatsAppLinkedIn

BeautifulSoup is the most popular Python library for parsing HTML and XML. It turns messy web pages into a navigable tree of Python objects.

Installation

pip install beautifulsoup4 lxml

Use lxml as the parser, it is significantly faster than the built-in html.parser.

Basic Parsing

from bs4 import BeautifulSoup

html = """
<html>
<body>
  <h1 class="title">Scraping Central</h1>
  <div class="products">
    <div class="product" data-id="1">
      <span class="name">Proxy Service</span>
      <span class="price">$29.99</span>
    </div>
    <div class="product" data-id="2">
      <span class="name">Scraping API</span>
      <span class="price">$49.99</span>
    </div>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html, "lxml")

Finding Elements

# Single element
title = soup.find("h1", class_="title")
print(title.text)  # "Scraping Central"

# All matching elements
products = soup.find_all("div", class_="product")
for product in products:
    name = product.find("span", class_="name").text
    price = product.find("span", class_="price").text
    data_id = product["data-id"]
    print(f"[{data_id}] {name}: {price}")
[1] Proxy Service: $29.99
[2] Scraping API: $49.99

CSS Selectors

The select method supports CSS selectors, which are often more concise:

# Select by class
prices = soup.select(".product .price")
for p in prices:
    print(p.text)

# Select by attribute
products = soup.select("div[data-id]")

# Select nth child
first_product = soup.select_one(".product:first-child .name")

# Combined selectors
links = soup.select("nav > ul > li > a[href]")

Extracting Data

from bs4 import BeautifulSoup
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

quotes = []
for q in soup.select(".quote"):
    quotes.append({
        "text": q.select_one(".text").get_text(strip=True),
        "author": q.select_one(".author").get_text(strip=True),
        "tags": [tag.text for tag in q.select(".tag")],
    })

for quote in quotes[:3]:
    print(f'"{quote["text"][:60]}..." - {quote["author"]}')
    print(f'  Tags: {", ".join(quote["tags"])}')

Common Methods Reference

Method Purpose Example
find() First matching element soup.find("a", href=True)
find_all() All matching elements soup.find_all("p")
select_one() First CSS match soup.select_one(".price")
select() All CSS matches soup.select("table tr")
.text / .get_text() Extract text content el.get_text(strip=True)
.attrs Get all attributes el.attrs -> {"href": "..."}
["attr"] Get specific attribute el["href"]
.get("attr") Safe attribute access el.get("href", "")
.parent Parent element el.parent
.children Direct children list(el.children)

Handling Missing Elements

Always guard against elements that might not exist:

# Bad: crashes if element not found
title = soup.find("h2").text  # AttributeError if no h2

# Good: safe access
el = soup.find("h2")
title = el.text if el else "No title"

# Better: use select_one with default
title = getattr(soup.select_one("h2"), "text", "No title")

When scraping pages that require JavaScript rendering before BeautifulSoup can parse them, use ScrapingAnt to get fully rendered HTML.

Next Steps

  • Learn CSS selectors vs XPath for targeting elements
  • Parse HTML tables into pandas DataFrames
  • Handle malformed and broken HTML