Extracting Data from HTML Tables
Scrape HTML tables from websites using BeautifulSoup and pandas. Handle complex tables with rowspan, colspan, and nested elements.
Python Scraping · #18beginner3 min read
HTML tables are one of the most common structures for organized data on the web, financial reports, sports statistics, comparison charts, and more. Python makes extracting this data straightforward.
Quick Method: pandas read_html
The fastest way to extract tables from a webpage is pandas.read_html(), which finds all <table> elements and converts them to DataFrames.
pip install pandas lxml
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
tables = pd.read_html(url)
print(f"Found {len(tables)} tables on the page")
# The main data table is usually the largest
df = tables[0]
print(df.head(10))
# Save to CSV
df.to_csv("population_data.csv", index=False)
Manual Extraction with BeautifulSoup
For more control over the extraction process, parse the table manually.
import requests
from bs4 import BeautifulSoup
import csv
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")
# Example: extract from any table on a page
html_with_table = """
<table class="data-table">
<thead>
<tr>
<th>Product</th>
<th>Price</th>
<th>Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>Widget Pro</td>
<td>$29.99</td>
<td>In Stock</td>
</tr>
<tr>
<td>Gadget Plus</td>
<td>$49.99</td>
<td>Out of Stock</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html_with_table, "html.parser")
table = soup.select_one("table.data-table")
# Extract headers
headers = [th.get_text(strip=True) for th in table.select("thead th")]
# Extract rows
rows = []
for tr in table.select("tbody tr"):
cells = [td.get_text(strip=True) for td in tr.select("td")]
rows.append(dict(zip(headers, cells)))
for row in rows:
print(row)
# {'Product': 'Widget Pro', 'Price': '$29.99', 'Stock': 'In Stock'}
# {'Product': 'Gadget Plus', 'Price': '$49.99', 'Stock': 'Out of Stock'}
A Reusable Table Scraper
import requests
from bs4 import BeautifulSoup
def scrape_tables(url, table_selector="table"):
"""Extract all tables from a URL into lists of dicts."""
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
results = []
for table in soup.select(table_selector):
headers = [th.get_text(strip=True) for th in table.select("th")]
if not headers:
# Fallback: use first row as headers
first_row = table.select_one("tr")
if first_row:
headers = [td.get_text(strip=True) for td in first_row.select("td")]
rows = []
for tr in table.select("tr")[1:] if headers else table.select("tr"):
cells = [td.get_text(strip=True) for td in tr.select("td")]
if cells and len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
results.append({"headers": headers, "rows": rows})
return results
# Usage
tables = scrape_tables("https://example.com/data")
for i, table in enumerate(tables):
print(f"Table {i}: {len(table['rows'])} rows, columns: {table['headers']}")
Handling Tables with Links and Attributes
Tables often contain links, images, or data attributes you want to capture alongside the text.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td><a href="/product/1">Widget Pro</a></td>
<td data-sort="29.99">$29.99</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tr in soup.select("tr"):
link_tag = tr.select_one("a")
price_tag = tr.select_one("td[data-sort]")
name = link_tag.get_text() if link_tag else ""
url = link_tag["href"] if link_tag else ""
price = price_tag["data-sort"] if price_tag else ""
print(f"{name} ({url}): ${price}")
Tips
pandas.read_html()is the fastest way to get table data, but it only extracts text, not links or attributes.- For complex tables with
rowspanorcolspan,pandas.read_html()handles spanning cells automatically. - When scraping tables from sites that block bots, use ScraperAPI or ScrapingAnt to fetch the pages reliably.
- Always check if the site offers a CSV/API download before scraping tables.
Next Steps
- Learn to scrape and download images and files
- Store extracted table data in a database for analysis