Scraping Tables, From HTML to Structured Data
Tables are everywhere in scraping. Headers, rows, cells, rowspan/colspan, nested tables. Master the patterns and turn HTML grids into clean dicts and DataFrames.
What you’ll learn
- Extract header row and cell values into list-of-dicts.
- Handle `colspan` and `rowspan` correctly.
- Use pandas' `read_html` shortcut when the markup is clean.
- Recognise when a 'table' isn't actually a `<table>`.
Tables are the densest format on the web for structured data, pricing grids, stat lines, schedules, comparison matrices. If you can scrape a table reliably, you can scrape almost anything tabular. This lesson is the systematic pattern.
The basic table structure
<table>
<thead>
<tr>
<th>Name</th><th>Price</th><th>In stock</th>
</tr>
</thead>
<tbody>
<tr><td>Yellow mug</td><td>$14.99</td><td>Yes</td></tr>
<tr><td>Black mug</td><td>$12.99</td><td>No</td></tr>
</tbody>
</table>
<thead> for headers, <tbody> for body rows, <th> for header cells, <td> for body cells. Many pages omit the <tbody>/<thead> wrappers, HTML parsers will insert them. Account for that in your selectors.
The canonical extraction
import requests
from bs4 import BeautifulSoup
r = requests.get("https://practice.scrapingcentral.com/challenges/static/tables/simple")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table")
# Headers
headers = [th.get_text(strip=True) for th in table.select("thead th")]
# Rows
rows = []
for tr in table.select("tbody tr"):
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
rows.append(dict(zip(headers, cells)))
print(rows)
# [{"Name": "Yellow mug", "Price": "$14.99", "In stock": "Yes"}...]
This is the pattern. Variations of it scrape 80% of HTML tables you'll meet.
When pandas can do it for you
If the table is clean and you have pandas installed, read_html is a one-liner:
import pandas as pd
dfs = pd.read_html("https://practice.scrapingcentral.com/challenges/static/tables/simple")
df = dfs[0] # read_html returns a list of all <table>s on the page
print(df.head())
pandas parses every table on the page and returns a list of DataFrames. Use [0] to grab the first, or filter by match= if multiple tables exist:
dfs = pd.read_html(url, match="Price") # only tables containing the word "Price"
Trade-off: read_html is opinionated. It struggles with: tables nested inside other tables (returns each separately), rowspan/colspan (sometimes mishandled), and rows with mixed <td>/<th> cells. For clean simple tables it's perfect; for tricky ones, drop back to manual extraction.
Handling colspan
<tr>
<th>Q1</th><th colspan="2">Q2</th><th>Q3</th>
</tr>
<tr>
<td>10</td><td>5</td><td>15</td><td>20</td>
</tr>
The header row has 3 <th> elements but the body row has 4 <td>s. Aligning them naively misaligns the data. Expand spans before zipping:
def expand_headers(tr):
headers = []
for th in tr.find_all("th"):
text = th.get_text(strip=True)
span = int(th.get("colspan", 1))
headers.extend([text] * span)
return headers
Now headers and body cells have matching length.
Handling rowspan
<tr><td rowspan="2">Acme</td><td>Widget</td><td>$5</td></tr>
<tr><td>Gadget</td><td>$8</td></tr>
Second row has only 2 cells visually, but the cell "Acme" "carries down" from row 1. To make the data rectangular:
def parse_table_with_rowspan(table):
rows = []
pending = {} # col_idx → (value, remaining_rows)
for tr in table.select("tbody tr"):
row_cells = []
cell_iter = iter(tr.find_all(["td", "th"]))
col = 0
while True:
# Carry down rowspan from previous rows
if col in pending:
value, remaining = pending[col]
row_cells.append(value)
if remaining - 1 == 0:
del pending[col]
else:
pending[col] = (value, remaining - 1)
col += 1
continue
try:
td = next(cell_iter)
except StopIteration:
break
text = td.get_text(strip=True)
colspan = int(td.get("colspan", 1))
rowspan = int(td.get("rowspan", 1))
for i in range(colspan):
row_cells.append(text)
if rowspan > 1:
pending[col] = (text, rowspan - 1)
col += 1
rows.append(row_cells)
return rows
Annoying but bulletproof. For sites you scrape regularly, it's worth saving this in your toolkit.
Nested tables
Some legacy pages use tables for layout, putting real data tables inside layout tables:
<table><!-- layout -->
<tr><td>
<table><!-- data --></table>
</td></tr>
</table>
soup.find_all("table") finds both. To get only the inner data table, filter by content (e.g. has a <th> containing "Price") or use recursive=False to find only top-level tables and pick the right one.
When a "table" isn't a <table>
Modern grid CSS lets developers make table-shaped layouts without <table> at all:
<div class="grid">
<div class="row"><div class="cell">Name</div><div class="cell">Price</div></div>
<div class="row"><div class="cell">Mug</div><div class="cell">$14</div></div>
</div>
pd.read_html won't find anything. Scrape it as a list of rows of cells with regular CSS selectors:
rows = []
for row in soup.select(".grid .row"):
cells = [c.get_text(strip=True) for c in row.select(".cell")]
rows.append(cells)
Always check the DOM first, if a "table-like" layout doesn't use <table>, none of the table-specific shortcuts work.
In PHP, DomCrawler
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
$headers = $crawler->filter('table thead th')->each(fn($c) => trim($c->text()));
$rows = $crawler->filter('table tbody tr')->each(function ($tr) use ($headers) {
$cells = $tr->filter('td')->each(fn($c) => trim($c->text()));
return array_combine($headers, $cells);
});
print_r($rows);
array_combine is PHP's dict(zip(...)). Length mismatch will throw, guard with count($headers) === count($cells) if rows are sometimes ragged.
Cleaning cell values
Cells often need post-processing:
def clean(text):
return text.replace("\xa0", " ").replace("", "").strip()
def parse_price(text):
return float(text.replace("$", "").replace(",", ""))
\xa0 is non-breaking space (a common HTML artifact); is zero-width space; both make string comparisons fail mysteriously.
Saving to a DataFrame
For analysis or CSV export, push the list-of-dicts into pandas:
import pandas as pd
df = pd.DataFrame(rows)
df.to_csv("products.csv", index=False)
df.to_json("products.json", orient="records", indent=2)
From here, Lesson 1.31 (data cleaning with pandas) takes over.
Hands-on lab
Scrape /challenges/static/tables/simple with the manual BeautifulSoup approach, then with pd.read_html. Confirm both produce the same data. Then try /challenges/static/tables/nested (next-up labs), read_html will return multiple DataFrames and you'll have to pick the right one or fall back to manual extraction.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/tables/simpleQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.