Parsing HTML Tables into DataFrames
Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.
Data Parsing · #8beginner3 min read
HTML tables are one of the most common data structures on the web, financial data, sports stats, product comparisons, government records. pandas makes extracting them trivial.
The Easy Way: pandas.read_html()
import pandas as pd
# Read all tables from a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)
print(f"Found {len(tables)} tables on the page")
# The main data table is usually the largest one
df = max(tables, key=len)
print(df.head())
Targeting a Specific Table
import pandas as pd
# Match by text in the table
tables = pd.read_html(
"https://en.wikipedia.org/wiki/Python_(programming_language)",
match="Paradigm", # Find table containing this text
)
if tables:
print(tables[0])
# Match by HTML attributes
tables = pd.read_html(
"https://example.com",
attrs={"id": "data-table"},
)
Custom Table Extraction with BeautifulSoup
When read_html does not handle a tricky table layout, extract manually:
from bs4 import BeautifulSoup
import pandas as pd
import requests
html = """
<table class="products">
<thead>
<tr>
<th>Product</th><th>Price</th><th>Rating</th>
</tr>
</thead>
<tbody>
<tr><td>ScraperAPI</td><td>$49.99</td><td>4.5/5</td></tr>
<tr><td>ScrapingAnt</td><td>$29.00</td><td>4.2/5</td></tr>
<tr><td>Bright Data</td><td>$99.00</td><td>4.7/5</td></tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, "lxml")
table = soup.select_one("table.products")
# Extract headers
headers = [th.text.strip() for th in table.select("thead th")]
# Extract rows
rows = []
for tr in table.select("tbody tr"):
row = [td.text.strip() for td in tr.select("td")]
rows.append(row)
df = pd.DataFrame(rows, columns=headers)
print(df)
Product Price Rating
0 ScraperAPI $49.99 4.5/5
1 ScrapingAnt $29.00 4.2/5
2 Bright Data $99.00 4.7/5
Handling Tables with Links and Attributes
Sometimes you need the link, not just the text:
from bs4 import BeautifulSoup
import pandas as pd
html = """
<table>
<tr><th>Company</th><th>Website</th></tr>
<tr>
<td><a href="https://scraperapi.com">ScraperAPI</a></td>
<td>scraperapi.com</td>
</tr>
<tr>
<td><a href="https://scrapingant.com">ScrapingAnt</a></td>
<td>scrapingant.com</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "lxml")
rows = []
for tr in soup.select("table tr")[1:]: # Skip header row
cells = tr.select("td")
link = cells[0].select_one("a")
rows.append({
"name": link.text.strip(),
"url": link["href"],
"domain": cells[1].text.strip(),
})
df = pd.DataFrame(rows)
print(df)
Scraping Multiple Pages of Tables
import pandas as pd
import requests
import time
all_data = []
for page in range(1, 6):
url = f"https://example.com/table-data?page={page}"
try:
tables = pd.read_html(url)
if tables:
all_data.append(tables[0])
print(f"Page {page}: {len(tables[0])} rows")
except ValueError:
print(f"Page {page}: no tables found")
break
time.sleep(1)
if all_data:
combined = pd.concat(all_data, ignore_index=True)
combined.to_csv("table_data.csv", index=False)
print(f"Total: {len(combined)} rows")
Common Issues
| Problem | Solution |
|---|---|
| No tables found | Page uses divs styled as tables, extract manually |
| Merged cells (colspan/rowspan) | pd.read_html handles most cases; complex ones need manual parsing |
| JavaScript-rendered tables | Use ScrapingAnt to get rendered HTML first |
| Encoding issues | Pass encoding="utf-8" to read_html |
Next Steps
- Handle malformed HTML in scraped pages
- Clean and validate table data with pandas
- Export data to CSV, JSON, and Excel formats