Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Data Cleaning After Scraping (pandas)

Clean, transform, and prepare scraped data for analysis using pandas. Handle missing values, duplicates, type conversions, and text normalization.

Data Parsing · #6intermediate3 min read
Share:WhatsAppLinkedIn

Raw scraped data is messy, inconsistent formatting, missing values, duplicates, and wrong types. pandas is the tool of choice for cleaning it into analysis-ready datasets.

Setup

pip install pandas

A Typical Messy Scraped Dataset

import pandas as pd

# Simulating raw scraped data
data = [
    {"name": "  ScraperAPI ", "price": "$49.99/mo", "rating": "4.5", "reviews": "1,234"},
    {"name": "ScrapingAnt", "price": "$29/mo", "rating": "4.2", "reviews": "890"},
    {"name": "scraperapi", "price": "$49.99/mo", "rating": "4.5", "reviews": "1234"},  # duplicate
    {"name": "Bright Data", "price": "Contact Sales", "rating": None, "reviews": ""},
    {"name": "Oxylabs", "price": "$99.00/mo", "rating": "4.7", "reviews": "2,100"},
    {"name": None, "price": "$19.99/mo", "rating": "3.8", "reviews": "456"},
]

df = pd.DataFrame(data)
print(df)

Step 1: Remove Duplicates

# Normalize names for comparison, then deduplicate
df["name_clean"] = df["name"].str.strip().str.lower()
df = df.drop_duplicates(subset="name_clean", keep="first")
df = df.drop(columns=["name_clean"])
print(f"After dedup: {len(df)} rows")

Step 2: Handle Missing Values

# Drop rows with no name
df = df.dropna(subset=["name"])

# Fill missing ratings with median
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
df["rating"] = df["rating"].fillna(df["rating"].median())

# Replace empty strings with NaN
df = df.replace("", pd.NA)

Step 3: Clean and Convert Types

import re

def parse_price(price_str):
    """Extract numeric price from strings like '$49.99/mo'."""
    if pd.isna(price_str):
        return None
    match = re.search(r'[\d.]+', str(price_str))
    return float(match.group()) if match else None

def parse_reviews(review_str):
    """Convert '1,234' to integer 1234."""
    if pd.isna(review_str):
        return 0
    cleaned = str(review_str).replace(",", "")
    return int(cleaned) if cleaned.isdigit() else 0

df["price_usd"] = df["price"].apply(parse_price)
df["review_count"] = df["reviews"].apply(parse_reviews)
df["name"] = df["name"].str.strip()

print(df[["name", "price_usd", "rating", "review_count"]])
         name  price_usd  rating  review_count
0   ScraperAPI      49.99     4.5          1234
1  ScrapingAnt      29.00     4.2           890
3   Bright Data       NaN     4.4             0
4      Oxylabs      99.00     4.7          2100

Step 4: Standardize Text

# Title case names
df["name"] = df["name"].str.title()

# Remove special characters
df["name"] = df["name"].str.replace(r'[^\w\s]', '', regex=True)

# Strip extra whitespace
df["name"] = df["name"].str.replace(r'\s+', ' ', regex=True).str.strip()

Step 5: Validate and Export

# Validation checks
assert df["name"].notna().all(), "Missing names found"
assert (df["rating"].between(0, 5) | df["rating"].isna()).all(), "Invalid ratings"

# Export clean data
df.to_csv("clean_products.csv", index=False)
df.to_json("clean_products.json", orient="records", indent=2)
print(f"Exported {len(df)} clean records")

Common Cleaning Operations

Problem pandas Solution
Leading/trailing spaces df["col"].str.strip()
Inconsistent case df["col"].str.lower()
Duplicate rows df.drop_duplicates()
Missing values df.fillna() or df.dropna()
Wrong data type pd.to_numeric(df["col"], errors="coerce")
Comma in numbers df["col"].str.replace(",", "")
HTML entities html.unescape(text)

Next Steps

  • Extract structured data from unstructured HTML
  • Deduplicate scraped data at scale
  • Normalize and validate data for production use