Data Cleaning After Scraping (pandas) - Data Parsing

Clean, transform, and prepare scraped data for analysis using pandas. Handle missing values, duplicates, type conversions, and text normalization.

Raw scraped data is messy, inconsistent formatting, missing values, duplicates, and wrong types. pandas is the tool of choice for cleaning it into analysis-ready datasets.

Setup

pip install pandas

A Typical Messy Scraped Dataset

import pandas as pd

# Simulating raw scraped data
data = [
    {"name": "  ScraperAPI ", "price": "$49.99/mo", "rating": "4.5", "reviews": "1,234"},
    {"name": "ScrapingAnt", "price": "$29/mo", "rating": "4.2", "reviews": "890"},
    {"name": "scraperapi", "price": "$49.99/mo", "rating": "4.5", "reviews": "1234"},  # duplicate
    {"name": "Bright Data", "price": "Contact Sales", "rating": None, "reviews": ""},
    {"name": "Oxylabs", "price": "$99.00/mo", "rating": "4.7", "reviews": "2,100"},
    {"name": None, "price": "$19.99/mo", "rating": "3.8", "reviews": "456"},
]

df = pd.DataFrame(data)
print(df)

Step 1: Remove Duplicates

# Normalize names for comparison, then deduplicate
df["name_clean"] = df["name"].str.strip().str.lower()
df = df.drop_duplicates(subset="name_clean", keep="first")
df = df.drop(columns=["name_clean"])
print(f"After dedup: {len(df)} rows")

Step 2: Handle Missing Values

# Drop rows with no name
df = df.dropna(subset=["name"])

# Fill missing ratings with median
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
df["rating"] = df["rating"].fillna(df["rating"].median())

# Replace empty strings with NaN
df = df.replace("", pd.NA)

Step 3: Clean and Convert Types

import re

def parse_price(price_str):
    """Extract numeric price from strings like '$49.99/mo'."""
    if pd.isna(price_str):
        return None
    match = re.search(r'[\d.]+', str(price_str))
    return float(match.group()) if match else None

def parse_reviews(review_str):
    """Convert '1,234' to integer 1234."""
    if pd.isna(review_str):
        return 0
    cleaned = str(review_str).replace(",", "")
    return int(cleaned) if cleaned.isdigit() else 0

df["price_usd"] = df["price"].apply(parse_price)
df["review_count"] = df["reviews"].apply(parse_reviews)
df["name"] = df["name"].str.strip()

print(df[["name", "price_usd", "rating", "review_count"]])

         name  price_usd  rating  review_count
0   ScraperAPI      49.99     4.5          1234
1  ScrapingAnt      29.00     4.2           890
3   Bright Data       NaN     4.4             0
4      Oxylabs      99.00     4.7          2100

Step 4: Standardize Text

# Title case names
df["name"] = df["name"].str.title()

# Remove special characters
df["name"] = df["name"].str.replace(r'[^\w\s]', '', regex=True)

# Strip extra whitespace
df["name"] = df["name"].str.replace(r'\s+', ' ', regex=True).str.strip()

Step 5: Validate and Export

# Validation checks
assert df["name"].notna().all(), "Missing names found"
assert (df["rating"].between(0, 5) | df["rating"].isna()).all(), "Invalid ratings"

# Export clean data
df.to_csv("clean_products.csv", index=False)
df.to_json("clean_products.json", orient="records", indent=2)
print(f"Exported {len(df)} clean records")

Common Cleaning Operations

Problem	pandas Solution
Leading/trailing spaces	`df["col"].str.strip()`
Inconsistent case	`df["col"].str.lower()`
Duplicate rows	`df.drop_duplicates()`
Missing values	`df.fillna()` or `df.dropna()`
Wrong data type	`pd.to_numeric(df["col"], errors="coerce")`
Comma in numbers	`df["col"].str.replace(",", "")`
HTML entities	`html.unescape(text)`

Next Steps

Extract structured data from unstructured HTML
Deduplicate scraped data at scale
Normalize and validate data for production use