Deduplication Strategies, Static Scraping

Scrapers produce duplicates: re-runs, paginated overlap, multiple URLs for the same item, near-identical rows with whitespace differences. Strategies from exact-match to fuzzy.

Every scraper produces duplicates eventually: same item across re-runs, same product on multiple URLs, similar-but-not-identical rows due to whitespace, the same news article on different sites. Choosing the right dedup strategy depends on how strict "same" needs to be.

Four levels of dedup, from strict to fuzzy

Level	Matches	When to use
Exact key	Same URL/ID	Re-scraping the same source
Content hash	Byte-identical rows	Catching repeat captures
Fingerprint	Same after normalization	Whitespace, ordering, trivial differences
Fuzzy match	"Probably the same" by similarity	Cross-source merging

Pick the loosest level that still satisfies "I don't want two of this in my data."

Level 1: exact key

The bread-and-butter case. Scrape lists products by URL; URL is the natural key.

Python with a set:

seen_urls = set()
clean = []
for row in rows:
  if row["source_url"] not in seen_urls:
  seen_urls.add(row["source_url"])
  clean.append(row)

SQL (the right answer when the data is in SQLite):

CREATE UNIQUE INDEX idx_url ON products(source_url);
-- now INSERT OR REPLACE handles duplicates at write time

PHP:

$seen = [];
$clean = [];
foreach ($rows as $row) {
  if (!isset($seen[$row['source_url']])) {
  $seen[$row['source_url']] = true;
  $clean[] = $row;
  }
}

When the data already lives in SQLite with a UNIQUE constraint, dedup is automatic at insert time, no separate dedup pass needed.

Level 2: content hash

Two rows have different URLs but identical content. Hash the content:

import hashlib, json

def content_hash(row):
  # Drop keys that aren't part of the content
  keys = sorted(k for k in row.keys() if k not in {"id", "scraped_at", "source_url"})
  canonical = json.dumps({k: row[k] for k in keys}, sort_keys=True, ensure_ascii=False)
  return hashlib.sha256(canonical.encode()).hexdigest()

seen = set()
clean = []
for row in rows:
  h = content_hash(row)
  if h not in seen:
  seen.add(h)
  clean.append(row)

Sort keys before serializing, {"a": 1, "b": 2} and {"b": 2, "a": 1} should hash the same. Use sha256 over content for stable identity; sha1 and md5 work but sha256 is the modern default.

Level 3: fingerprinting

Sometimes the content is "the same" but differs in whitespace, capitalization, or trailing punctuation. Normalize before hashing:

import re, unicodedata

def fingerprint(s):
  s = unicodedata.normalize("NFKC", s)  # combine accents, etc.
  s = s.lower()
  s = re.sub(r"\s+", " ", s).strip()  # collapse whitespace
  s = re.sub(r"[^\w\s]", "", s)  # strip punctuation
  return s

Now ' Yellow MUG! ' and 'Yellow mug' both fingerprint to 'yellow mug'. Hash the fingerprint(s) of the discriminating fields:

def row_fingerprint(row):
  return (fingerprint(row["name"]), fingerprint(row["category"]))

Use this as the dedup key.

This catches the most common "soft duplicate" cases, re-scraped pages where the site changed whitespace but not data, or two URLs that lead to the same content with different formatting.

Level 4: fuzzy matching

For cross-source merging, "this product on site A is the same product on site B", exact and fingerprint match are too strict. Use string-similarity metrics:

from rapidfuzz import fuzz

ratio = fuzz.ratio("Yellow ceramic mug 12oz", "Yellow Ceramic Mug, 12 oz")
print(ratio)  # 89 (out of 100)

# Token-based, more tolerant of word reordering
print(fuzz.token_sort_ratio("Mug ceramic yellow", "yellow ceramic mug"))  # 100

rapidfuzz is a fast C-backed library (pip install rapidfuzz). Set a threshold (typically 85-95) and pair rows above it as duplicates.

PHP equivalent:

require 'vendor/autoload.php';
similar_text("Yellow ceramic mug", "Yellow Ceramic Mug, 12oz", $pct);
echo $pct;

For better PHP fuzzy matching, look at wikimedia/levenshtein-php or use PHP's built-in levenshtein() for short strings.

Blocking, fuzzy match performance

Fuzzy matching every pair is O(n²) and gets slow fast. Use blocking to compare only within candidate groups:

from collections import defaultdict

def block_key(row):
  # First letter of brand + first 3 chars of product code
  return f"{row['brand'][0].lower()}-{row['code'][:3]}"

blocks = defaultdict(list)
for row in rows:
  blocks[block_key(row)].append(row)

# Only fuzzy-compare within each block
for key, group in blocks.items():
  for i in range(len(group)):
  for j in range(i + 1, len(group)):
  if fuzz.token_sort_ratio(group[i]["name"], group[j]["name"]) > 90:
  # Likely duplicates
  ...

This reduces the comparison count from O(n²) to O(n × avg_block_size).

Choose the right key for your data

Data	Likely key
Product pages on one site	URL
Same product across sites	brand + model + size fingerprint
News articles	URL OR title fingerprint OR DOI
Job postings	title + company + location fingerprint
People records	name fingerprint + email OR phone
Real estate	address fingerprint OR lat/lon proximity

There's no universal answer. Spend a few minutes thinking about what makes two records "the same" for YOUR project.

Keep the dedup decision auditable

When merging, don't silently drop one duplicate. Track it:

clean = {}
duplicates = []

for row in rows:
  key = fingerprint(row["name"])
  if key in clean:
  duplicates.append({"kept": clean[key]["id"], "dropped": row["id"]})
  else:
  clean[key] = row

# clean.values() is the deduped data
# duplicates is the audit log

If your downstream analysis finds something weird, the audit log lets you trace why.

Incremental dedup against a database

For ongoing scrapes, dedupe at write time:

import sqlite3

con = sqlite3.connect("scrape.db")
cur = con.cursor()
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS idx_fp ON products(fingerprint)")

for row in new_rows:
  fp = fingerprint(row["name"])
  cur.execute("""
  INSERT INTO products (fingerprint, source_url, name, price, scraped_at)
  VALUES (?, ?, ?, ?, CURRENT_TIMESTAMP)
  ON CONFLICT(fingerprint) DO UPDATE SET
  scraped_at = CURRENT_TIMESTAMP,
  price = excluded.price
  """, (fp, row["source_url"], row["name"], row["price"]))

con.commit()

The UNIQUE INDEX + UPSERT pattern handles ongoing dedup with no separate cleaning pass. This scales to millions of rows.

Hands-on lab

Scrape /blog (multiple pages). Likely duplicates: links to the same post via different /blog/page/N and /blog/tag/... and /blog/author/... routes. Use a content fingerprint (title + first paragraph) to dedupe across these routes. Print the count before and after dedup. Inspect a few duplicates to confirm they really were the same article from different listing pages.

Deduplication Strategies

What you’ll learn

Four levels of dedup, from strict to fuzzy

Level 1: exact key

Level 2: content hash

Level 3: fingerprinting

Level 4: fuzzy matching

Blocking, fuzzy match performance

Choose the right key for your data

Keep the dedup decision auditable

Incremental dedup against a database

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which deduplication level is appropriate for 'same product page across multiple re-scrapes of one site'?