Deduplication at Scale (Bloom Filters, Content Hashing), Production, Scale & Career

Once you're scraping millions of URLs, naive sets blow up your RAM. Bloom filters, content hashes, and the right approximate-or-exact tradeoff.

A crawler's first dedup is "have I seen this URL?" The naive answer is a Python set or a Redis SET, and it works fine up to a few million URLs. Past that, memory and storage costs make smarter structures worth it.

This lesson covers two complementary dedup layers: URL-level (don't re-fetch the same URL) and content-level (don't re-store the same payload).

URL dedup: exact vs approximate

Approach	Memory	False positives	False negatives
Python `set` of URLs	~100 bytes per URL	None	None
Redis SET / SADD	~80 bytes per URL	None	None
Redis SET of hashes (SHA1)	40 bytes per URL	None	None
Bloom filter	~1.2 bytes per URL @ 1% FP	~1% (configurable)	None
Cuckoo filter	~1.5 bytes per URL	~3% (smaller)	None

For most scrapers exact dedup works. Bloom filters pay off when you have 100M+ URLs and RAM matters more than a tiny re-fetch rate.

Bloom filters, the intuition

A Bloom filter is an array of bits plus k hash functions. To "add" an item, hash it k ways and set those bits. To "check," hash it the same way and check all k bits are set.

All bits set → item might be in the set (could be a coincidence: false positive).
Any bit unset → item definitely is not in the set (no false negatives).

For URL deduplication, a false positive means you skip a URL you actually haven't visited, usually harmless. False negatives (visit twice) cost a re-fetch.

Sizing

For n items and false-positive rate p:

Bits needed: m = -n * ln(p) / (ln(2)^2)
Optimal hash functions: k = (m/n) * ln(2)

100M URLs at 1% FP rate: ~120MB of bits and 7 hash functions. The same in a Python set: ~10GB.

Python with pybloom_live

from pybloom_live import ScalableBloomFilter

bf = ScalableBloomFilter(initial_capacity=10_000_000, error_rate=0.01,
  mode=ScalableBloomFilter.LARGE_SET_GROWTH)

def should_fetch(url: str) -> bool:
  if url in bf:
  return False
  bf.add(url)
  return True

ScalableBloomFilter grows automatically as you add items, handy when you don't know the upper bound.

Redis-backed Bloom

For distributed scrapers, all workers must share the dedup state. RedisBloom (a Redis module) gives you network-shared Bloom filters:

import redis
r = redis.Redis()
r.execute_command("BF.RESERVE", "seen_urls", 0.01, 100_000_000)

def should_fetch(url: str) -> bool:
  added = r.execute_command("BF.ADD", "seen_urls", url)
  return added == 1  # 1 means new, 0 means seen

BF.ADD is atomic; multiple workers can hit it concurrently without races.

Content hashing, detect unchanged pages

URL dedup says "I've seen the URL." Content hashing says "the page hasn't changed since last fetch." Useful when re-crawling existing pages, skip parsing if content is identical.

import hashlib

def content_hash(html: str) -> str:
  # Normalize: strip whitespace, lowercase tags, depends on noise level
  canonical = " ".join(html.split())
  return hashlib.sha256(canonical.encode()).hexdigest()

Store (url, content_hash, last_seen) in Postgres. On next fetch:

SELECT content_hash FROM page_snapshots WHERE url = $1;

If the new hash matches the stored one, skip parsing and storage. Update last_seen only.

Beware noise: timestamps in HTML, CSRF tokens, ad slots, these change every request. Normalize by stripping known-volatile fragments before hashing, or hash only the content section after extraction.

Fuzzy content hashing

For "this page is mostly unchanged," exact SHA isn't enough. Options:

SimHash, produces a fingerprint where Hamming distance correlates with similarity. Two pages with 95% the same text → simhashes differ in only a few bits.
MinHash + LSH, for near-duplicate detection across a corpus (find products with the same description on different sites).
Diff fingerprinting, extract the structured data, hash the canonical JSON.

A common pragmatic pattern: extract structured fields → canonical JSON → SHA256 → compare. Ignore the surrounding HTML noise entirely.

PHP implementations

PHP has equivalents:

// Murmur via hash(), fast non-cryptographic
$hash = hash('md5', $canonical);  // for content
// Bloom: composer require ottosmops/bloomy or ekreative/bloom-filter
$bf->add($url);
$seen = $bf->has($url);

For Redis-backed Bloom in PHP, use the Redis client's executeRaw(['BF.ADD'...]).

Combining layers

Production pattern:

URL Bloom filter, does cheap "have I queued this?" check at enqueue time.
Redis set of URL hashes, for hot recent URLs, exact dedup.
Content hash in Postgres, per-URL, detects when the page itself changed.

Layered: cheap probabilistic at the wide end, exact and persistent at the narrow end.

False-positive tuning

Bloom filter FP rate is a knob. Errors of 0.1% (10x more memory than 1%) feel "exact" in practice. For URL dedup, 1% is often fine, accidentally skipping 1% of URLs is rarely catastrophic. For dedup that gates downstream billing or alerts, prefer exact stores.

What to try

Generate 1M random URLs. Time:

Adding/checking with a Python set.
Same with a Bloom filter at 1% FP rate.

Compare RAM (use tracemalloc). The Bloom filter should be ~100x smaller while still rejecting 99% of duplicates correctly. Then add a content-hash check against Catalog108 /products pages, second-fetch should skip parsing entirely when the page hasn't changed.

Deduplication at Scale (Bloom Filters, Content Hashing)

What you’ll learn