Scraping Tables, From HTML to Structured Data, Static Scraping

Tables are everywhere in scraping. Headers, rows, cells, rowspan/colspan, nested tables. Master the patterns and turn HTML grids into clean dicts and DataFrames.

Tables are the densest format on the web for structured data, pricing grids, stat lines, schedules, comparison matrices. If you can scrape a table reliably, you can scrape almost anything tabular. This lesson is the systematic pattern.

The basic table structure

<table>
  <thead>
  <tr>
  <th>Name</th><th>Price</th><th>In stock</th>
  </tr>
  </thead>
  <tbody>
  <tr><td>Yellow mug</td><td>$14.99</td><td>Yes</td></tr>
  <tr><td>Black mug</td><td>$12.99</td><td>No</td></tr>
  </tbody>
</table>

<thead> for headers, <tbody> for body rows, <th> for header cells, <td> for body cells. Many pages omit the <tbody>/<thead> wrappers, HTML parsers will insert them. Account for that in your selectors.

The canonical extraction

import requests
from bs4 import BeautifulSoup

r = requests.get("https://practice.scrapingcentral.com/challenges/static/tables/simple")
soup = BeautifulSoup(r.content, "lxml")

table = soup.find("table")

# Headers
headers = [th.get_text(strip=True) for th in table.select("thead th")]

# Rows
rows = []
for tr in table.select("tbody tr"):
  cells = [td.get_text(strip=True) for td in tr.find_all("td")]
  rows.append(dict(zip(headers, cells)))

print(rows)
# [{"Name": "Yellow mug", "Price": "$14.99", "In stock": "Yes"}...]

This is the pattern. Variations of it scrape 80% of HTML tables you'll meet.

When pandas can do it for you

If the table is clean and you have pandas installed, read_html is a one-liner:

import pandas as pd

dfs = pd.read_html("https://practice.scrapingcentral.com/challenges/static/tables/simple")
df = dfs[0]  # read_html returns a list of all <table>s on the page
print(df.head())

pandas parses every table on the page and returns a list of DataFrames. Use [0] to grab the first, or filter by match= if multiple tables exist:

dfs = pd.read_html(url, match="Price")  # only tables containing the word "Price"

Trade-off: read_html is opinionated. It struggles with: tables nested inside other tables (returns each separately), rowspan/colspan (sometimes mishandled), and rows with mixed <td>/<th> cells. For clean simple tables it's perfect; for tricky ones, drop back to manual extraction.

Handling colspan

<tr>
  <th>Q1</th><th colspan="2">Q2</th><th>Q3</th>
</tr>
<tr>
  <td>10</td><td>5</td><td>15</td><td>20</td>
</tr>

The header row has 3 <th> elements but the body row has 4 <td>s. Aligning them naively misaligns the data. Expand spans before zipping:

def expand_headers(tr):
  headers = []
  for th in tr.find_all("th"):
  text = th.get_text(strip=True)
  span = int(th.get("colspan", 1))
  headers.extend([text] * span)
  return headers

Now headers and body cells have matching length.

Handling rowspan

<tr><td rowspan="2">Acme</td><td>Widget</td><td>$5</td></tr>
<tr><td>Gadget</td><td>$8</td></tr>

Second row has only 2 cells visually, but the cell "Acme" "carries down" from row 1. To make the data rectangular:

def parse_table_with_rowspan(table):
  rows = []
  pending = {}  # col_idx → (value, remaining_rows)

  for tr in table.select("tbody tr"):
  row_cells = []
  cell_iter = iter(tr.find_all(["td", "th"]))
  col = 0
  while True:
  # Carry down rowspan from previous rows
  if col in pending:
  value, remaining = pending[col]
  row_cells.append(value)
  if remaining - 1 == 0:
  del pending[col]
  else:
  pending[col] = (value, remaining - 1)
  col += 1
  continue
  try:
  td = next(cell_iter)
  except StopIteration:
  break
  text = td.get_text(strip=True)
  colspan = int(td.get("colspan", 1))
  rowspan = int(td.get("rowspan", 1))
  for i in range(colspan):
  row_cells.append(text)
  if rowspan > 1:
  pending[col] = (text, rowspan - 1)
  col += 1
  rows.append(row_cells)
  return rows

Annoying but bulletproof. For sites you scrape regularly, it's worth saving this in your toolkit.

Nested tables

Some legacy pages use tables for layout, putting real data tables inside layout tables:

<table><!-- layout -->
  <tr><td>
  <table><!-- data --></table>
  </td></tr>
</table>

soup.find_all("table") finds both. To get only the inner data table, filter by content (e.g. has a <th> containing "Price") or use recursive=False to find only top-level tables and pick the right one.

When a "table" isn't a `<table>`

Modern grid CSS lets developers make table-shaped layouts without <table> at all:

<div class="grid">
  <div class="row"><div class="cell">Name</div><div class="cell">Price</div></div>
  <div class="row"><div class="cell">Mug</div><div class="cell">$14</div></div>
</div>

pd.read_html won't find anything. Scrape it as a list of rows of cells with regular CSS selectors:

rows = []
for row in soup.select(".grid .row"):
  cells = [c.get_text(strip=True) for c in row.select(".cell")]
  rows.append(cells)

Always check the DOM first, if a "table-like" layout doesn't use <table>, none of the table-specific shortcuts work.

In PHP, DomCrawler

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);

$headers = $crawler->filter('table thead th')->each(fn($c) => trim($c->text()));

$rows = $crawler->filter('table tbody tr')->each(function ($tr) use ($headers) {
  $cells = $tr->filter('td')->each(fn($c) => trim($c->text()));
  return array_combine($headers, $cells);
});

print_r($rows);

array_combine is PHP's dict(zip(...)). Length mismatch will throw, guard with count($headers) === count($cells) if rows are sometimes ragged.

Cleaning cell values

Cells often need post-processing:

def clean(text):
  return text.replace("\xa0", " ").replace("", "").strip()

def parse_price(text):
  return float(text.replace("$", "").replace(",", ""))

\xa0 is non-breaking space (a common HTML artifact); is zero-width space; both make string comparisons fail mysteriously.

Saving to a DataFrame

For analysis or CSV export, push the list-of-dicts into pandas:

import pandas as pd
df = pd.DataFrame(rows)
df.to_csv("products.csv", index=False)
df.to_json("products.json", orient="records", indent=2)

From here, Lesson 1.31 (data cleaning with pandas) takes over.

Hands-on lab

Scrape /challenges/static/tables/simple with the manual BeautifulSoup approach, then with pd.read_html. Confirm both produce the same data. Then try /challenges/static/tables/nested (next-up labs), read_html will return multiple DataFrames and you'll have to pick the right one or fall back to manual extraction.

Scraping Tables, From HTML to Structured Data

What you’ll learn

The basic table structure

The canonical extraction

When pandas can do it for you

Handling colspan

Handling rowspan

Nested tables

When a "table" isn't a `<table>`

In PHP, DomCrawler

Cleaning cell values

Saving to a DataFrame

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why does `soup.find('table').find_all('tr')` sometimes match TWO `<tr>` per visible row on a page with nested tables?

Scraping Tables, From HTML to Structured Data

What you’ll learn

The basic table structure

The canonical extraction

When pandas can do it for you

Handling colspan

Handling rowspan

Nested tables

When a "table" isn't a <table>

In PHP, DomCrawler

Cleaning cell values

Saving to a DataFrame

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why does `soup.find('table').find_all('tr')` sometimes match TWO `<tr>` per visible row on a page with nested tables?

When a "table" isn't a `<table>`