Extracting Data from HTML Tables - Python Scraping

Scrape HTML tables from websites using BeautifulSoup and pandas. Handle complex tables with rowspan, colspan, and nested elements.

HTML tables are one of the most common structures for organized data on the web, financial reports, sports statistics, comparison charts, and more. Python makes extracting this data straightforward.

Quick Method: pandas read_html

The fastest way to extract tables from a webpage is pandas.read_html(), which finds all <table> elements and converts them to DataFrames.

pip install pandas lxml

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
tables = pd.read_html(url)

print(f"Found {len(tables)} tables on the page")

# The main data table is usually the largest
df = tables[0]
print(df.head(10))

# Save to CSV
df.to_csv("population_data.csv", index=False)

Manual Extraction with BeautifulSoup

For more control over the extraction process, parse the table manually.

import requests
from bs4 import BeautifulSoup
import csv

response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")

# Example: extract from any table on a page
html_with_table = """
<table class="data-table">
    <thead>
        <tr>
            <th>Product</th>
            <th>Price</th>
            <th>Stock</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Widget Pro</td>
            <td>$29.99</td>
            <td>In Stock</td>
        </tr>
        <tr>
            <td>Gadget Plus</td>
            <td>$49.99</td>
            <td>Out of Stock</td>
        </tr>
    </tbody>
</table>
"""

soup = BeautifulSoup(html_with_table, "html.parser")
table = soup.select_one("table.data-table")

# Extract headers
headers = [th.get_text(strip=True) for th in table.select("thead th")]

# Extract rows
rows = []
for tr in table.select("tbody tr"):
    cells = [td.get_text(strip=True) for td in tr.select("td")]
    rows.append(dict(zip(headers, cells)))

for row in rows:
    print(row)
# {'Product': 'Widget Pro', 'Price': '$29.99', 'Stock': 'In Stock'}
# {'Product': 'Gadget Plus', 'Price': '$49.99', 'Stock': 'Out of Stock'}

A Reusable Table Scraper

import requests
from bs4 import BeautifulSoup


def scrape_tables(url, table_selector="table"):
    """Extract all tables from a URL into lists of dicts."""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    results = []

    for table in soup.select(table_selector):
        headers = [th.get_text(strip=True) for th in table.select("th")]

        if not headers:
            # Fallback: use first row as headers
            first_row = table.select_one("tr")
            if first_row:
                headers = [td.get_text(strip=True) for td in first_row.select("td")]

        rows = []
        for tr in table.select("tr")[1:] if headers else table.select("tr"):
            cells = [td.get_text(strip=True) for td in tr.select("td")]
            if cells and len(cells) == len(headers):
                rows.append(dict(zip(headers, cells)))

        results.append({"headers": headers, "rows": rows})

    return results


# Usage
tables = scrape_tables("https://example.com/data")
for i, table in enumerate(tables):
    print(f"Table {i}: {len(table['rows'])} rows, columns: {table['headers']}")

Handling Tables with Links and Attributes

Tables often contain links, images, or data attributes you want to capture alongside the text.

from bs4 import BeautifulSoup

html = """
<table>
    <tr>
        <td><a href="/product/1">Widget Pro</a></td>
        <td data-sort="29.99">$29.99</td>
    </tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")

for tr in soup.select("tr"):
    link_tag = tr.select_one("a")
    price_tag = tr.select_one("td[data-sort]")

    name = link_tag.get_text() if link_tag else ""
    url = link_tag["href"] if link_tag else ""
    price = price_tag["data-sort"] if price_tag else ""

    print(f"{name} ({url}): ${price}")

Tips

pandas.read_html() is the fastest way to get table data, but it only extracts text, not links or attributes.
For complex tables with rowspan or colspan, pandas.read_html() handles spanning cells automatically.
When scraping tables from sites that block bots, use ScraperAPI or ScrapingAnt to fetch the pages reliably.
Always check if the site offers a CSV/API download before scraping tables.

Next Steps

Learn to scrape and download images and files
Store extracted table data in a database for analysis