Data Cleaning with pandas (Python)
Scraped data is dirty. Use pandas to type-coerce, normalize, dedupe, and reshape into something usable downstream, the canonical post-scrape pipeline.
What you’ll learn
- Load scraped JSON/JSONL into a DataFrame.
- Type-coerce prices, dates, and counts.
- Strip whitespace, normalize case, and remove duplicates.
- Export cleaned data to CSV, JSON, or Parquet.
Raw scraped data has whitespace, currency symbols, non-breaking spaces, inconsistent capitalization, duplicates, and a hundred other paper cuts. pandas is the canonical fix. By the end of this lesson, you have a cleaning pipeline you can drop on top of any scrape.
Install
pip install pandas pyarrow
pyarrow enables Parquet export, covered at the end.
Load scraped data
import pandas as pd
# JSON file with a single array
df = pd.read_json("products.json")
# JSONL (one object per line)
df = pd.read_json("products.jsonl", lines=True)
# CSV
df = pd.read_csv("products.csv")
# From SQLite
import sqlite3
con = sqlite3.connect("scrape.db")
df = pd.read_sql("SELECT * FROM products", con)
Inspect:
print(df.shape) # rows, cols
print(df.dtypes) # column types, most will be 'object' (string) initially
print(df.head())
print(df.describe()) # summary stats for numeric columns
Type coercion: prices
Scraped prices look like '$14.99', '€12,50', '1,299.00 USD'. To do math, convert to float:
def parse_price(s):
if pd.isna(s):
return None
s = str(s).replace("$", "").replace("€", "").replace(",", "")
s = s.replace("USD", "").replace("EUR", "").strip()
try:
return float(s)
except ValueError:
return None
df["price"] = df["price"].apply(parse_price)
After this, df["price"].mean(), .sort_values("price"), and arithmetic all work.
Type coercion: dates
df["scraped_at"] = pd.to_datetime(df["scraped_at"], errors="coerce")
df["published"] = pd.to_datetime(df["published"], format="%Y-%m-%d", errors="coerce")
errors="coerce" turns unparseable values into NaT (Not-a-Time) rather than raising. Combine with df["scraped_at"].isna().sum() to count failures.
For tricky formats, specify format= explicitly. The default parser is fast but sometimes too clever.
Whitespace and case
# Strip surrounding whitespace from every string column
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
# Replace non-breaking spaces (a common HTML artifact)
df["name"] = df["name"].str.replace("\xa0", " ", regex=False)
# Normalize case
df["category"] = df["category"].str.lower()
.str.* methods only work on string columns, they silently skip non-string dtypes.
Missing values
# Count missing per column
print(df.isna().sum())
# Drop rows with any missing values
df = df.dropna()
# Fill with defaults
df["price"] = df["price"].fillna(0)
df["category"] = df["category"].fillna("uncategorized")
# Drop columns that are entirely empty
df = df.dropna(axis=1, how="all")
Don't drop NaNs blindly, sometimes the missingness IS the data. Always inspect first.
Deduplication
# Drop exact duplicates
df = df.drop_duplicates()
# Drop based on a subset of columns
df = df.drop_duplicates(subset=["source_url"])
# Keep the most recent (after sorting)
df = df.sort_values("scraped_at").drop_duplicates(subset=["source_url"], keep="last")
For scrapers that re-run over time, the third pattern keeps the latest version of each item.
Normalize / explode nested fields
Scraped data often has list-valued columns (tags, categories):
# If tags is a list per row
df["tags_str"] = df["tags"].apply(lambda lst: ",".join(lst) if isinstance(lst, list) else lst)
# Or explode into multiple rows (one per tag)
exploded = df.explode("tags")
Lesson 1.33 covers deduplication in depth; lesson 1.29 covers output shape choices.
Add derived columns
df["price_with_tax"] = df["price"] * 1.10
df["domain"] = df["source_url"].str.extract(r"https?://([^/]+)/")
df["scraped_date"] = df["scraped_at"].dt.date
df["is_expensive"] = df["price"] > 50
Vectorized operations, much faster than .apply with a Python loop. pandas is built for this.
Filter
expensive = df[df["price"] > 50]
recent = df[df["scraped_at"] > "2025-01-01"]
combined = df[(df["price"] > 50) & (df["category"] == "electronics")]
The & (and |, ~) is element-wise. Use parens around each condition.
Group and aggregate
# Average price per category
print(df.groupby("category")["price"].mean())
# Multi-column aggregate
print(df.groupby("category").agg(
count=("source_url", "count"),
avg_price=("price", "mean"),
max_price=("price", "max"),
))
This is where pandas earns its keep. The same query in raw Python takes 20 lines.
Validate after cleaning
assert df["price"].dtype == "float64"
assert df["price"].min() > 0
assert df.shape[0] > 0
assert not df["source_url"].duplicated().any()
Inline assertions catch regressions in your scraper or cleaning code.
Export
# CSV
df.to_csv("clean.csv", index=False, encoding="utf-8")
# JSON
df.to_json("clean.json", orient="records", indent=2, force_ascii=False)
# JSONL
df.to_json("clean.jsonl", orient="records", lines=True, force_ascii=False)
# Parquet, columnar, compressed, fast for analytics
df.to_parquet("clean.parquet")
# SQLite
import sqlite3
con = sqlite3.connect("clean.db")
df.to_sql("products", con, if_exists="replace", index=False)
Parquet is the go-to for any data that will be re-analyzed: 5-20x smaller than CSV, 10-100x faster to read.
A reusable cleaning pipeline
def clean(df):
# Whitespace and types
str_cols = df.select_dtypes("object").columns
df[str_cols] = df[str_cols].apply(lambda c: c.str.strip().str.replace("\xa0", " ", regex=False))
# Parse prices
if "price" in df.columns:
df["price"] = df["price"].apply(parse_price)
# Parse dates
for col in ["scraped_at", "published", "updated_at"]:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
# Dedup
if "source_url" in df.columns:
df = df.sort_values("scraped_at").drop_duplicates(subset=["source_url"], keep="last")
# Drop fully-empty rows
df = df.dropna(how="all")
return df.reset_index(drop=True)
Drop this on top of any scraper output. Customize per project, but the shape stays constant.
Hands-on lab
Scrape /api/products/1/reviews (a JSON API). Load the response into a DataFrame. Parse the rating column to int, the date column to datetime, the author column to a cleaned string. Compute average rating per author. Export the cleaned data to both CSV and Parquet; compare the file sizes.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/products/1/reviewsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.