Extracting Structured Data from Unstructured HTML - Data Parsing

Techniques for pulling structured records from messy, inconsistent HTML pages. Handle missing elements, variable layouts, and embedded metadata.

Real-world HTML is messy. Elements are missing, class names are inconsistent, and data is scattered across the page. Here are battle-tested techniques for extracting clean, structured records.

Defensive Extraction Pattern

Always assume elements might not exist:

from bs4 import BeautifulSoup

html = """
<div class="listing">
  <h3>ScraperAPI</h3>
  <span class="price">$49.99</span>
  <p class="desc">Proxy rotation and CAPTCHA solving</p>
</div>
<div class="listing">
  <h3>ScrapingAnt</h3>
  <!-- price is missing! -->
  <p class="desc">Headless browser API</p>
</div>
<div class="listing">
  <h3>Bright Data</h3>
  <span class="price">Contact Sales</span>
  <!-- description missing! -->
</div>
"""

soup = BeautifulSoup(html, "lxml")

def extract_listing(el):
    """Safely extract data from a listing element."""
    return {
        "name": getattr(el.select_one("h3"), "text", "").strip(),
        "price": getattr(el.select_one(".price"), "text", "N/A").strip(),
        "description": getattr(el.select_one(".desc"), "text", "").strip(),
    }

listings = [extract_listing(el) for el in soup.select(".listing")]
for item in listings:
    print(f"{item['name']}: {item['price']} - {item['description'][:40]}")

ScraperAPI: $49.99 - Proxy rotation and CAPTCHA solving
ScrapingAnt: N/A - Headless browser API
Bright Data: Contact Sales -

Extracting JSON-LD Structured Data

Many sites embed structured data (Schema.org) in <script> tags. This is the cleanest source:

from bs4 import BeautifulSoup
import json
import requests

response = requests.get("https://www.example.com/product", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

# Find JSON-LD scripts
for script in soup.select('script[type="application/ld+json"]'):
    data = json.loads(script.string)
    if data.get("@type") == "Product":
        print(f"Name: {data['name']}")
        print(f"Price: {data['offers']['price']} {data['offers']['priceCurrency']}")
        print(f"Rating: {data.get('aggregateRating', {}).get('ratingValue', 'N/A')}")

Extracting from Meta Tags

Meta tags often contain useful structured data:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

# Open Graph and standard meta tags
meta_data = {}
for meta in soup.select("meta[property], meta[name]"):
    key = meta.get("property") or meta.get("name")
    value = meta.get("content", "")
    if key and value:
        meta_data[key] = value

for key, value in meta_data.items():
    print(f"{key}: {value}")

Pattern: Table-Like Data Without Tables

Sometimes data is displayed in div-based layouts rather than actual tables:

from bs4 import BeautifulSoup

html = """
<div class="specs">
  <div class="spec-row">
    <span class="label">Requests/mo</span>
    <span class="value">100,000</span>
  </div>
  <div class="spec-row">
    <span class="label">Concurrent</span>
    <span class="value">50</span>
  </div>
  <div class="spec-row">
    <span class="label">Support</span>
    <span class="value">24/7 Email</span>
  </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

specs = {}
for row in soup.select(".spec-row"):
    label = row.select_one(".label").text.strip()
    value = row.select_one(".value").text.strip()
    specs[label] = value

print(specs)
# {'Requests/mo': '100,000', 'Concurrent': '50', 'Support': '24/7 Email'}

Combining Multiple Sources

The best approach combines all available structured data:

def extract_product(soup):
    """Extract product data from multiple sources on the page."""
    product = {}

    # 1. Try JSON-LD first (most reliable)
    for script in soup.select('script[type="application/ld+json"]'):
        try:
            ld = json.loads(script.string)
            if ld.get("@type") == "Product":
                product["name"] = ld.get("name")
                product["price"] = ld.get("offers", {}).get("price")
                return product
        except json.JSONDecodeError:
            continue

    # 2. Fall back to meta tags
    product["name"] = (
        soup.select_one('meta[property="og:title"]') or {}
    ).get("content")

    # 3. Fall back to HTML elements
    if not product.get("name"):
        product["name"] = getattr(soup.select_one("h1"), "text", "Unknown")

    product["price"] = getattr(soup.select_one(".price"), "text", "N/A")
    return product

For JavaScript-heavy pages where structured data is rendered dynamically, ScrapingAnt can return the fully rendered HTML for parsing.

Next Steps

Parse HTML tables into DataFrames
Handle malformed and broken HTML
Extract emails and phone numbers from web pages