Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Parsing HTML Tables into DataFrames

Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.

Data Parsing · #8beginner3 min read
Share:WhatsAppLinkedIn

HTML tables are one of the most common data structures on the web, financial data, sports stats, product comparisons, government records. pandas makes extracting them trivial.

The Easy Way: pandas.read_html()

import pandas as pd

# Read all tables from a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)

print(f"Found {len(tables)} tables on the page")

# The main data table is usually the largest one
df = max(tables, key=len)
print(df.head())

Targeting a Specific Table

import pandas as pd

# Match by text in the table
tables = pd.read_html(
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    match="Paradigm",  # Find table containing this text
)
if tables:
    print(tables[0])

# Match by HTML attributes
tables = pd.read_html(
    "https://example.com",
    attrs={"id": "data-table"},
)

Custom Table Extraction with BeautifulSoup

When read_html does not handle a tricky table layout, extract manually:

from bs4 import BeautifulSoup
import pandas as pd
import requests

html = """
<table class="products">
  <thead>
    <tr>
      <th>Product</th><th>Price</th><th>Rating</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>ScraperAPI</td><td>$49.99</td><td>4.5/5</td></tr>
    <tr><td>ScrapingAnt</td><td>$29.00</td><td>4.2/5</td></tr>
    <tr><td>Bright Data</td><td>$99.00</td><td>4.7/5</td></tr>
  </tbody>
</table>
"""

soup = BeautifulSoup(html, "lxml")
table = soup.select_one("table.products")

# Extract headers
headers = [th.text.strip() for th in table.select("thead th")]

# Extract rows
rows = []
for tr in table.select("tbody tr"):
    row = [td.text.strip() for td in tr.select("td")]
    rows.append(row)

df = pd.DataFrame(rows, columns=headers)
print(df)
      Product  Price Rating
0   ScraperAPI  $49.99  4.5/5
1  ScrapingAnt  $29.00  4.2/5
2   Bright Data  $99.00  4.7/5

Handling Tables with Links and Attributes

Sometimes you need the link, not just the text:

from bs4 import BeautifulSoup
import pandas as pd

html = """
<table>
  <tr><th>Company</th><th>Website</th></tr>
  <tr>
    <td><a href="https://scraperapi.com">ScraperAPI</a></td>
    <td>scraperapi.com</td>
  </tr>
  <tr>
    <td><a href="https://scrapingant.com">ScrapingAnt</a></td>
    <td>scrapingant.com</td>
  </tr>
</table>
"""

soup = BeautifulSoup(html, "lxml")
rows = []
for tr in soup.select("table tr")[1:]:  # Skip header row
    cells = tr.select("td")
    link = cells[0].select_one("a")
    rows.append({
        "name": link.text.strip(),
        "url": link["href"],
        "domain": cells[1].text.strip(),
    })

df = pd.DataFrame(rows)
print(df)

Scraping Multiple Pages of Tables

import pandas as pd
import requests
import time

all_data = []
for page in range(1, 6):
    url = f"https://example.com/table-data?page={page}"
    try:
        tables = pd.read_html(url)
        if tables:
            all_data.append(tables[0])
            print(f"Page {page}: {len(tables[0])} rows")
    except ValueError:
        print(f"Page {page}: no tables found")
        break
    time.sleep(1)

if all_data:
    combined = pd.concat(all_data, ignore_index=True)
    combined.to_csv("table_data.csv", index=False)
    print(f"Total: {len(combined)} rows")

Common Issues

Problem Solution
No tables found Page uses divs styled as tables, extract manually
Merged cells (colspan/rowspan) pd.read_html handles most cases; complex ones need manual parsing
JavaScript-rendered tables Use ScrapingAnt to get rendered HTML first
Encoding issues Pass encoding="utf-8" to read_html

Next Steps

  • Handle malformed HTML in scraped pages
  • Clean and validate table data with pandas
  • Export data to CSV, JSON, and Excel formats