Parsing HTML Tables into DataFrames - Data Parsing

Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.

HTML tables are one of the most common data structures on the web, financial data, sports stats, product comparisons, government records. pandas makes extracting them trivial.

The Easy Way: pandas.read_html()

import pandas as pd

# Read all tables from a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)

print(f"Found {len(tables)} tables on the page")

# The main data table is usually the largest one
df = max(tables, key=len)
print(df.head())

Targeting a Specific Table

import pandas as pd

# Match by text in the table
tables = pd.read_html(
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    match="Paradigm",  # Find table containing this text
)
if tables:
    print(tables[0])

# Match by HTML attributes
tables = pd.read_html(
    "https://example.com",
    attrs={"id": "data-table"},
)

Custom Table Extraction with BeautifulSoup

When read_html does not handle a tricky table layout, extract manually:

from bs4 import BeautifulSoup
import pandas as pd
import requests

html = """
<table class="products">
  <thead>
    <tr>
      <th>Product</th><th>Price</th><th>Rating</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>ScraperAPI</td><td>$49.99</td><td>4.5/5</td></tr>
    <tr><td>ScrapingAnt</td><td>$29.00</td><td>4.2/5</td></tr>
    <tr><td>Bright Data</td><td>$99.00</td><td>4.7/5</td></tr>
  </tbody>
</table>
"""

soup = BeautifulSoup(html, "lxml")
table = soup.select_one("table.products")

# Extract headers
headers = [th.text.strip() for th in table.select("thead th")]

# Extract rows
rows = []
for tr in table.select("tbody tr"):
    row = [td.text.strip() for td in tr.select("td")]
    rows.append(row)

df = pd.DataFrame(rows, columns=headers)
print(df)

      Product  Price Rating
0   ScraperAPI  $49.99  4.5/5
1  ScrapingAnt  $29.00  4.2/5
2   Bright Data  $99.00  4.7/5

Handling Tables with Links and Attributes

Sometimes you need the link, not just the text:

from bs4 import BeautifulSoup
import pandas as pd

html = """
<table>
  <tr><th>Company</th><th>Website</th></tr>
  <tr>
    <td><a href="https://scraperapi.com">ScraperAPI</a></td>
    <td>scraperapi.com</td>
  </tr>
  <tr>
    <td><a href="https://scrapingant.com">ScrapingAnt</a></td>
    <td>scrapingant.com</td>
  </tr>
</table>
"""

soup = BeautifulSoup(html, "lxml")
rows = []
for tr in soup.select("table tr")[1:]:  # Skip header row
    cells = tr.select("td")
    link = cells[0].select_one("a")
    rows.append({
        "name": link.text.strip(),
        "url": link["href"],
        "domain": cells[1].text.strip(),
    })

df = pd.DataFrame(rows)
print(df)

Scraping Multiple Pages of Tables

import pandas as pd
import requests
import time

all_data = []
for page in range(1, 6):
    url = f"https://example.com/table-data?page={page}"
    try:
        tables = pd.read_html(url)
        if tables:
            all_data.append(tables[0])
            print(f"Page {page}: {len(tables[0])} rows")
    except ValueError:
        print(f"Page {page}: no tables found")
        break
    time.sleep(1)

if all_data:
    combined = pd.concat(all_data, ignore_index=True)
    combined.to_csv("table_data.csv", index=False)
    print(f"Total: {len(combined)} rows")

Common Issues

Problem	Solution
No tables found	Page uses divs styled as tables, extract manually
Merged cells (colspan/rowspan)	`pd.read_html` handles most cases; complex ones need manual parsing
JavaScript-rendered tables	Use ScrapingAnt to get rendered HTML first
Encoding issues	Pass `encoding="utf-8"` to `read_html`

Next Steps

Handle malformed HTML in scraped pages
Clean and validate table data with pandas
Export data to CSV, JSON, and Excel formats