Deduplication Strategies
Scrapers produce duplicates: re-runs, paginated overlap, multiple URLs for the same item, near-identical rows with whitespace differences. Strategies from exact-match to fuzzy.
What you’ll learn
- Deduplicate on exact key (URL, ID), the bread-and-butter case.
- Hash content to dedupe rows that are byte-identical.
- Use fingerprinting for near-duplicates (whitespace, ordering, light variations).
- Apply fuzzy matching for cross-source merging.
Every scraper produces duplicates eventually: same item across re-runs, same product on multiple URLs, similar-but-not-identical rows due to whitespace, the same news article on different sites. Choosing the right dedup strategy depends on how strict "same" needs to be.
Four levels of dedup, from strict to fuzzy
| Level | Matches | When to use |
|---|---|---|
| Exact key | Same URL/ID | Re-scraping the same source |
| Content hash | Byte-identical rows | Catching repeat captures |
| Fingerprint | Same after normalization | Whitespace, ordering, trivial differences |
| Fuzzy match | "Probably the same" by similarity | Cross-source merging |
Pick the loosest level that still satisfies "I don't want two of this in my data."
Level 1: exact key
The bread-and-butter case. Scrape lists products by URL; URL is the natural key.
Python with a set:
seen_urls = set()
clean = []
for row in rows:
if row["source_url"] not in seen_urls:
seen_urls.add(row["source_url"])
clean.append(row)
SQL (the right answer when the data is in SQLite):
CREATE UNIQUE INDEX idx_url ON products(source_url);
-- now INSERT OR REPLACE handles duplicates at write time
PHP:
$seen = [];
$clean = [];
foreach ($rows as $row) {
if (!isset($seen[$row['source_url']])) {
$seen[$row['source_url']] = true;
$clean[] = $row;
}
}
When the data already lives in SQLite with a UNIQUE constraint, dedup is automatic at insert time, no separate dedup pass needed.
Level 2: content hash
Two rows have different URLs but identical content. Hash the content:
import hashlib, json
def content_hash(row):
# Drop keys that aren't part of the content
keys = sorted(k for k in row.keys() if k not in {"id", "scraped_at", "source_url"})
canonical = json.dumps({k: row[k] for k in keys}, sort_keys=True, ensure_ascii=False)
return hashlib.sha256(canonical.encode()).hexdigest()
seen = set()
clean = []
for row in rows:
h = content_hash(row)
if h not in seen:
seen.add(h)
clean.append(row)
Sort keys before serializing, {"a": 1, "b": 2} and {"b": 2, "a": 1} should hash the same. Use sha256 over content for stable identity; sha1 and md5 work but sha256 is the modern default.
Level 3: fingerprinting
Sometimes the content is "the same" but differs in whitespace, capitalization, or trailing punctuation. Normalize before hashing:
import re, unicodedata
def fingerprint(s):
s = unicodedata.normalize("NFKC", s) # combine accents, etc.
s = s.lower()
s = re.sub(r"\s+", " ", s).strip() # collapse whitespace
s = re.sub(r"[^\w\s]", "", s) # strip punctuation
return s
Now ' Yellow MUG! ' and 'Yellow mug' both fingerprint to 'yellow mug'. Hash the fingerprint(s) of the discriminating fields:
def row_fingerprint(row):
return (fingerprint(row["name"]), fingerprint(row["category"]))
Use this as the dedup key.
This catches the most common "soft duplicate" cases, re-scraped pages where the site changed whitespace but not data, or two URLs that lead to the same content with different formatting.
Level 4: fuzzy matching
For cross-source merging, "this product on site A is the same product on site B", exact and fingerprint match are too strict. Use string-similarity metrics:
from rapidfuzz import fuzz
ratio = fuzz.ratio("Yellow ceramic mug 12oz", "Yellow Ceramic Mug, 12 oz")
print(ratio) # 89 (out of 100)
# Token-based, more tolerant of word reordering
print(fuzz.token_sort_ratio("Mug ceramic yellow", "yellow ceramic mug")) # 100
rapidfuzz is a fast C-backed library (pip install rapidfuzz). Set a threshold (typically 85-95) and pair rows above it as duplicates.
PHP equivalent:
require 'vendor/autoload.php';
similar_text("Yellow ceramic mug", "Yellow Ceramic Mug, 12oz", $pct);
echo $pct;
For better PHP fuzzy matching, look at wikimedia/levenshtein-php or use PHP's built-in levenshtein() for short strings.
Blocking, fuzzy match performance
Fuzzy matching every pair is O(n²) and gets slow fast. Use blocking to compare only within candidate groups:
from collections import defaultdict
def block_key(row):
# First letter of brand + first 3 chars of product code
return f"{row['brand'][0].lower()}-{row['code'][:3]}"
blocks = defaultdict(list)
for row in rows:
blocks[block_key(row)].append(row)
# Only fuzzy-compare within each block
for key, group in blocks.items():
for i in range(len(group)):
for j in range(i + 1, len(group)):
if fuzz.token_sort_ratio(group[i]["name"], group[j]["name"]) > 90:
# Likely duplicates
...
This reduces the comparison count from O(n²) to O(n × avg_block_size).
Choose the right key for your data
| Data | Likely key |
|---|---|
| Product pages on one site | URL |
| Same product across sites | brand + model + size fingerprint |
| News articles | URL OR title fingerprint OR DOI |
| Job postings | title + company + location fingerprint |
| People records | name fingerprint + email OR phone |
| Real estate | address fingerprint OR lat/lon proximity |
There's no universal answer. Spend a few minutes thinking about what makes two records "the same" for YOUR project.
Keep the dedup decision auditable
When merging, don't silently drop one duplicate. Track it:
clean = {}
duplicates = []
for row in rows:
key = fingerprint(row["name"])
if key in clean:
duplicates.append({"kept": clean[key]["id"], "dropped": row["id"]})
else:
clean[key] = row
# clean.values() is the deduped data
# duplicates is the audit log
If your downstream analysis finds something weird, the audit log lets you trace why.
Incremental dedup against a database
For ongoing scrapes, dedupe at write time:
import sqlite3
con = sqlite3.connect("scrape.db")
cur = con.cursor()
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS idx_fp ON products(fingerprint)")
for row in new_rows:
fp = fingerprint(row["name"])
cur.execute("""
INSERT INTO products (fingerprint, source_url, name, price, scraped_at)
VALUES (?, ?, ?, ?, CURRENT_TIMESTAMP)
ON CONFLICT(fingerprint) DO UPDATE SET
scraped_at = CURRENT_TIMESTAMP,
price = excluded.price
""", (fp, row["source_url"], row["name"], row["price"]))
con.commit()
The UNIQUE INDEX + UPSERT pattern handles ongoing dedup with no separate cleaning pass. This scales to millions of rows.
Hands-on lab
Scrape /blog (multiple pages). Likely duplicates: links to the same post via different /blog/page/N and /blog/tag/... and /blog/author/... routes. Use a content fingerprint (title + first paragraph) to dedupe across these routes. Print the count before and after dedup. Inspect a few duplicates to confirm they really were the same article from different listing pages.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/blogQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.