Data Cleaning After Scraping (pandas)
Clean, transform, and prepare scraped data for analysis using pandas. Handle missing values, duplicates, type conversions, and text normalization.
Data Parsing · #6intermediate3 min read
Raw scraped data is messy, inconsistent formatting, missing values, duplicates, and wrong types. pandas is the tool of choice for cleaning it into analysis-ready datasets.
Setup
pip install pandas
A Typical Messy Scraped Dataset
import pandas as pd
# Simulating raw scraped data
data = [
{"name": " ScraperAPI ", "price": "$49.99/mo", "rating": "4.5", "reviews": "1,234"},
{"name": "ScrapingAnt", "price": "$29/mo", "rating": "4.2", "reviews": "890"},
{"name": "scraperapi", "price": "$49.99/mo", "rating": "4.5", "reviews": "1234"}, # duplicate
{"name": "Bright Data", "price": "Contact Sales", "rating": None, "reviews": ""},
{"name": "Oxylabs", "price": "$99.00/mo", "rating": "4.7", "reviews": "2,100"},
{"name": None, "price": "$19.99/mo", "rating": "3.8", "reviews": "456"},
]
df = pd.DataFrame(data)
print(df)
Step 1: Remove Duplicates
# Normalize names for comparison, then deduplicate
df["name_clean"] = df["name"].str.strip().str.lower()
df = df.drop_duplicates(subset="name_clean", keep="first")
df = df.drop(columns=["name_clean"])
print(f"After dedup: {len(df)} rows")
Step 2: Handle Missing Values
# Drop rows with no name
df = df.dropna(subset=["name"])
# Fill missing ratings with median
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
df["rating"] = df["rating"].fillna(df["rating"].median())
# Replace empty strings with NaN
df = df.replace("", pd.NA)
Step 3: Clean and Convert Types
import re
def parse_price(price_str):
"""Extract numeric price from strings like '$49.99/mo'."""
if pd.isna(price_str):
return None
match = re.search(r'[\d.]+', str(price_str))
return float(match.group()) if match else None
def parse_reviews(review_str):
"""Convert '1,234' to integer 1234."""
if pd.isna(review_str):
return 0
cleaned = str(review_str).replace(",", "")
return int(cleaned) if cleaned.isdigit() else 0
df["price_usd"] = df["price"].apply(parse_price)
df["review_count"] = df["reviews"].apply(parse_reviews)
df["name"] = df["name"].str.strip()
print(df[["name", "price_usd", "rating", "review_count"]])
name price_usd rating review_count
0 ScraperAPI 49.99 4.5 1234
1 ScrapingAnt 29.00 4.2 890
3 Bright Data NaN 4.4 0
4 Oxylabs 99.00 4.7 2100
Step 4: Standardize Text
# Title case names
df["name"] = df["name"].str.title()
# Remove special characters
df["name"] = df["name"].str.replace(r'[^\w\s]', '', regex=True)
# Strip extra whitespace
df["name"] = df["name"].str.replace(r'\s+', ' ', regex=True).str.strip()
Step 5: Validate and Export
# Validation checks
assert df["name"].notna().all(), "Missing names found"
assert (df["rating"].between(0, 5) | df["rating"].isna()).all(), "Invalid ratings"
# Export clean data
df.to_csv("clean_products.csv", index=False)
df.to_json("clean_products.json", orient="records", indent=2)
print(f"Exported {len(df)} clean records")
Common Cleaning Operations
| Problem | pandas Solution |
|---|---|
| Leading/trailing spaces | df["col"].str.strip() |
| Inconsistent case | df["col"].str.lower() |
| Duplicate rows | df.drop_duplicates() |
| Missing values | df.fillna() or df.dropna() |
| Wrong data type | pd.to_numeric(df["col"], errors="coerce") |
| Comma in numbers | df["col"].str.replace(",", "") |
| HTML entities | html.unescape(text) |
Next Steps
- Extract structured data from unstructured HTML
- Deduplicate scraped data at scale
- Normalize and validate data for production use