Extracting Structured Data from Unstructured HTML
Techniques for pulling structured records from messy, inconsistent HTML pages. Handle missing elements, variable layouts, and embedded metadata.
Real-world HTML is messy. Elements are missing, class names are inconsistent, and data is scattered across the page. Here are battle-tested techniques for extracting clean, structured records.
Defensive Extraction Pattern
Always assume elements might not exist:
from bs4 import BeautifulSoup
html = """
<div class="listing">
<h3>ScraperAPI</h3>
<span class="price">$49.99</span>
<p class="desc">Proxy rotation and CAPTCHA solving</p>
</div>
<div class="listing">
<h3>ScrapingAnt</h3>
<!-- price is missing! -->
<p class="desc">Headless browser API</p>
</div>
<div class="listing">
<h3>Bright Data</h3>
<span class="price">Contact Sales</span>
<!-- description missing! -->
</div>
"""
soup = BeautifulSoup(html, "lxml")
def extract_listing(el):
"""Safely extract data from a listing element."""
return {
"name": getattr(el.select_one("h3"), "text", "").strip(),
"price": getattr(el.select_one(".price"), "text", "N/A").strip(),
"description": getattr(el.select_one(".desc"), "text", "").strip(),
}
listings = [extract_listing(el) for el in soup.select(".listing")]
for item in listings:
print(f"{item['name']}: {item['price']} - {item['description'][:40]}")
ScraperAPI: $49.99 - Proxy rotation and CAPTCHA solving
ScrapingAnt: N/A - Headless browser API
Bright Data: Contact Sales -
Extracting JSON-LD Structured Data
Many sites embed structured data (Schema.org) in <script> tags. This is the cleanest source:
from bs4 import BeautifulSoup
import json
import requests
response = requests.get("https://www.example.com/product", timeout=15)
soup = BeautifulSoup(response.text, "lxml")
# Find JSON-LD scripts
for script in soup.select('script[type="application/ld+json"]'):
data = json.loads(script.string)
if data.get("@type") == "Product":
print(f"Name: {data['name']}")
print(f"Price: {data['offers']['price']} {data['offers']['priceCurrency']}")
print(f"Rating: {data.get('aggregateRating', {}).get('ratingValue', 'N/A')}")
Extracting from Meta Tags
Meta tags often contain useful structured data:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")
# Open Graph and standard meta tags
meta_data = {}
for meta in soup.select("meta[property], meta[name]"):
key = meta.get("property") or meta.get("name")
value = meta.get("content", "")
if key and value:
meta_data[key] = value
for key, value in meta_data.items():
print(f"{key}: {value}")
Pattern: Table-Like Data Without Tables
Sometimes data is displayed in div-based layouts rather than actual tables:
from bs4 import BeautifulSoup
html = """
<div class="specs">
<div class="spec-row">
<span class="label">Requests/mo</span>
<span class="value">100,000</span>
</div>
<div class="spec-row">
<span class="label">Concurrent</span>
<span class="value">50</span>
</div>
<div class="spec-row">
<span class="label">Support</span>
<span class="value">24/7 Email</span>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
specs = {}
for row in soup.select(".spec-row"):
label = row.select_one(".label").text.strip()
value = row.select_one(".value").text.strip()
specs[label] = value
print(specs)
# {'Requests/mo': '100,000', 'Concurrent': '50', 'Support': '24/7 Email'}
Combining Multiple Sources
The best approach combines all available structured data:
def extract_product(soup):
"""Extract product data from multiple sources on the page."""
product = {}
# 1. Try JSON-LD first (most reliable)
for script in soup.select('script[type="application/ld+json"]'):
try:
ld = json.loads(script.string)
if ld.get("@type") == "Product":
product["name"] = ld.get("name")
product["price"] = ld.get("offers", {}).get("price")
return product
except json.JSONDecodeError:
continue
# 2. Fall back to meta tags
product["name"] = (
soup.select_one('meta[property="og:title"]') or {}
).get("content")
# 3. Fall back to HTML elements
if not product.get("name"):
product["name"] = getattr(soup.select_one("h1"), "text", "Unknown")
product["price"] = getattr(soup.select_one(".price"), "text", "N/A")
return product
For JavaScript-heavy pages where structured data is rendered dynamically, ScrapingAnt can return the fully rendered HTML for parsing.
Next Steps
- Parse HTML tables into DataFrames
- Handle malformed and broken HTML
- Extract emails and phone numbers from web pages