Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.31intermediate4 min read

Data Cleaning with pandas (Python)

Scraped data is dirty. Use pandas to type-coerce, normalize, dedupe, and reshape into something usable downstream, the canonical post-scrape pipeline.

What you’ll learn

  • Load scraped JSON/JSONL into a DataFrame.
  • Type-coerce prices, dates, and counts.
  • Strip whitespace, normalize case, and remove duplicates.
  • Export cleaned data to CSV, JSON, or Parquet.

Raw scraped data has whitespace, currency symbols, non-breaking spaces, inconsistent capitalization, duplicates, and a hundred other paper cuts. pandas is the canonical fix. By the end of this lesson, you have a cleaning pipeline you can drop on top of any scrape.

Install

pip install pandas pyarrow

pyarrow enables Parquet export, covered at the end.

Load scraped data

import pandas as pd

# JSON file with a single array
df = pd.read_json("products.json")

# JSONL (one object per line)
df = pd.read_json("products.jsonl", lines=True)

# CSV
df = pd.read_csv("products.csv")

# From SQLite
import sqlite3
con = sqlite3.connect("scrape.db")
df = pd.read_sql("SELECT * FROM products", con)

Inspect:

print(df.shape)  # rows, cols
print(df.dtypes)  # column types, most will be 'object' (string) initially
print(df.head())
print(df.describe())  # summary stats for numeric columns

Type coercion: prices

Scraped prices look like '$14.99', '€12,50', '1,299.00 USD'. To do math, convert to float:

def parse_price(s):
  if pd.isna(s):
  return None
  s = str(s).replace("$", "").replace("€", "").replace(",", "")
  s = s.replace("USD", "").replace("EUR", "").strip()
  try:
  return float(s)
  except ValueError:
  return None

df["price"] = df["price"].apply(parse_price)

After this, df["price"].mean(), .sort_values("price"), and arithmetic all work.

Type coercion: dates

df["scraped_at"] = pd.to_datetime(df["scraped_at"], errors="coerce")
df["published"]  = pd.to_datetime(df["published"], format="%Y-%m-%d", errors="coerce")

errors="coerce" turns unparseable values into NaT (Not-a-Time) rather than raising. Combine with df["scraped_at"].isna().sum() to count failures.

For tricky formats, specify format= explicitly. The default parser is fast but sometimes too clever.

Whitespace and case

# Strip surrounding whitespace from every string column
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)

# Replace non-breaking spaces (a common HTML artifact)
df["name"] = df["name"].str.replace("\xa0", " ", regex=False)

# Normalize case
df["category"] = df["category"].str.lower()

.str.* methods only work on string columns, they silently skip non-string dtypes.

Missing values

# Count missing per column
print(df.isna().sum())

# Drop rows with any missing values
df = df.dropna()

# Fill with defaults
df["price"] = df["price"].fillna(0)
df["category"] = df["category"].fillna("uncategorized")

# Drop columns that are entirely empty
df = df.dropna(axis=1, how="all")

Don't drop NaNs blindly, sometimes the missingness IS the data. Always inspect first.

Deduplication

# Drop exact duplicates
df = df.drop_duplicates()

# Drop based on a subset of columns
df = df.drop_duplicates(subset=["source_url"])

# Keep the most recent (after sorting)
df = df.sort_values("scraped_at").drop_duplicates(subset=["source_url"], keep="last")

For scrapers that re-run over time, the third pattern keeps the latest version of each item.

Normalize / explode nested fields

Scraped data often has list-valued columns (tags, categories):

# If tags is a list per row
df["tags_str"] = df["tags"].apply(lambda lst: ",".join(lst) if isinstance(lst, list) else lst)

# Or explode into multiple rows (one per tag)
exploded = df.explode("tags")

Lesson 1.33 covers deduplication in depth; lesson 1.29 covers output shape choices.

Add derived columns

df["price_with_tax"] = df["price"] * 1.10
df["domain"] = df["source_url"].str.extract(r"https?://([^/]+)/")
df["scraped_date"] = df["scraped_at"].dt.date
df["is_expensive"] = df["price"] > 50

Vectorized operations, much faster than .apply with a Python loop. pandas is built for this.

Filter

expensive = df[df["price"] > 50]
recent = df[df["scraped_at"] > "2025-01-01"]
combined = df[(df["price"] > 50) & (df["category"] == "electronics")]

The & (and |, ~) is element-wise. Use parens around each condition.

Group and aggregate

# Average price per category
print(df.groupby("category")["price"].mean())

# Multi-column aggregate
print(df.groupby("category").agg(
  count=("source_url", "count"),
  avg_price=("price", "mean"),
  max_price=("price", "max"),
))

This is where pandas earns its keep. The same query in raw Python takes 20 lines.

Validate after cleaning

assert df["price"].dtype == "float64"
assert df["price"].min() > 0
assert df.shape[0] > 0
assert not df["source_url"].duplicated().any()

Inline assertions catch regressions in your scraper or cleaning code.

Export

# CSV
df.to_csv("clean.csv", index=False, encoding="utf-8")

# JSON
df.to_json("clean.json", orient="records", indent=2, force_ascii=False)

# JSONL
df.to_json("clean.jsonl", orient="records", lines=True, force_ascii=False)

# Parquet, columnar, compressed, fast for analytics
df.to_parquet("clean.parquet")

# SQLite
import sqlite3
con = sqlite3.connect("clean.db")
df.to_sql("products", con, if_exists="replace", index=False)

Parquet is the go-to for any data that will be re-analyzed: 5-20x smaller than CSV, 10-100x faster to read.

A reusable cleaning pipeline

def clean(df):
  # Whitespace and types
  str_cols = df.select_dtypes("object").columns
  df[str_cols] = df[str_cols].apply(lambda c: c.str.strip().str.replace("\xa0", " ", regex=False))

  # Parse prices
  if "price" in df.columns:
  df["price"] = df["price"].apply(parse_price)

  # Parse dates
  for col in ["scraped_at", "published", "updated_at"]:
  if col in df.columns:
  df[col] = pd.to_datetime(df[col], errors="coerce")

  # Dedup
  if "source_url" in df.columns:
  df = df.sort_values("scraped_at").drop_duplicates(subset=["source_url"], keep="last")

  # Drop fully-empty rows
  df = df.dropna(how="all")
  return df.reset_index(drop=True)

Drop this on top of any scraper output. Customize per project, but the shape stays constant.

Hands-on lab

Scrape /api/products/1/reviews (a JSON API). Load the response into a DataFrame. Parse the rating column to int, the date column to datetime, the author column to a cleaned string. Compute average rating per author. Export the cleaned data to both CSV and Parquet; compare the file sizes.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /api/products/1/reviews

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Data Cleaning with pandas (Python)1 / 8

Which pandas function loads a JSONL file (one JSON object per line)?

Score so far: 0 / 0