Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.53advanced5 min read

Deduplication at Scale (Bloom Filters, Content Hashing)

Once you're scraping millions of URLs, naive sets blow up your RAM. Bloom filters, content hashes, and the right approximate-or-exact tradeoff.

What you’ll learn

  • Decide between exact and probabilistic deduplication.
  • Implement a Bloom filter for URL seen-tracking.
  • Use content hashing to detect unchanged pages.

A crawler's first dedup is "have I seen this URL?" The naive answer is a Python set or a Redis SET, and it works fine up to a few million URLs. Past that, memory and storage costs make smarter structures worth it.

This lesson covers two complementary dedup layers: URL-level (don't re-fetch the same URL) and content-level (don't re-store the same payload).

URL dedup: exact vs approximate

Approach Memory False positives False negatives
Python set of URLs ~100 bytes per URL None None
Redis SET / SADD ~80 bytes per URL None None
Redis SET of hashes (SHA1) 40 bytes per URL None None
Bloom filter ~1.2 bytes per URL @ 1% FP ~1% (configurable) None
Cuckoo filter ~1.5 bytes per URL ~3% (smaller) None

For most scrapers exact dedup works. Bloom filters pay off when you have 100M+ URLs and RAM matters more than a tiny re-fetch rate.

Bloom filters, the intuition

A Bloom filter is an array of bits plus k hash functions. To "add" an item, hash it k ways and set those bits. To "check," hash it the same way and check all k bits are set.

  • All bits set → item might be in the set (could be a coincidence: false positive).
  • Any bit unset → item definitely is not in the set (no false negatives).

For URL deduplication, a false positive means you skip a URL you actually haven't visited, usually harmless. False negatives (visit twice) cost a re-fetch.

Sizing

For n items and false-positive rate p:

  • Bits needed: m = -n * ln(p) / (ln(2)^2)
  • Optimal hash functions: k = (m/n) * ln(2)

100M URLs at 1% FP rate: ~120MB of bits and 7 hash functions. The same in a Python set: ~10GB.

Python with pybloom_live

from pybloom_live import ScalableBloomFilter

bf = ScalableBloomFilter(initial_capacity=10_000_000, error_rate=0.01,
  mode=ScalableBloomFilter.LARGE_SET_GROWTH)

def should_fetch(url: str) -> bool:
  if url in bf:
  return False
  bf.add(url)
  return True

ScalableBloomFilter grows automatically as you add items, handy when you don't know the upper bound.

Redis-backed Bloom

For distributed scrapers, all workers must share the dedup state. RedisBloom (a Redis module) gives you network-shared Bloom filters:

import redis
r = redis.Redis()
r.execute_command("BF.RESERVE", "seen_urls", 0.01, 100_000_000)

def should_fetch(url: str) -> bool:
  added = r.execute_command("BF.ADD", "seen_urls", url)
  return added == 1  # 1 means new, 0 means seen

BF.ADD is atomic; multiple workers can hit it concurrently without races.

Content hashing, detect unchanged pages

URL dedup says "I've seen the URL." Content hashing says "the page hasn't changed since last fetch." Useful when re-crawling existing pages, skip parsing if content is identical.

import hashlib

def content_hash(html: str) -> str:
  # Normalize: strip whitespace, lowercase tags, depends on noise level
  canonical = " ".join(html.split())
  return hashlib.sha256(canonical.encode()).hexdigest()

Store (url, content_hash, last_seen) in Postgres. On next fetch:

SELECT content_hash FROM page_snapshots WHERE url = $1;

If the new hash matches the stored one, skip parsing and storage. Update last_seen only.

Beware noise: timestamps in HTML, CSRF tokens, ad slots, these change every request. Normalize by stripping known-volatile fragments before hashing, or hash only the content section after extraction.

Fuzzy content hashing

For "this page is mostly unchanged," exact SHA isn't enough. Options:

  • SimHash, produces a fingerprint where Hamming distance correlates with similarity. Two pages with 95% the same text → simhashes differ in only a few bits.
  • MinHash + LSH, for near-duplicate detection across a corpus (find products with the same description on different sites).
  • Diff fingerprinting, extract the structured data, hash the canonical JSON.

A common pragmatic pattern: extract structured fields → canonical JSON → SHA256 → compare. Ignore the surrounding HTML noise entirely.

PHP implementations

PHP has equivalents:

// Murmur via hash(), fast non-cryptographic
$hash = hash('md5', $canonical);  // for content
// Bloom: composer require ottosmops/bloomy or ekreative/bloom-filter
$bf->add($url);
$seen = $bf->has($url);

For Redis-backed Bloom in PHP, use the Redis client's executeRaw(['BF.ADD'...]).

Combining layers

Production pattern:

  1. URL Bloom filter, does cheap "have I queued this?" check at enqueue time.
  2. Redis set of URL hashes, for hot recent URLs, exact dedup.
  3. Content hash in Postgres, per-URL, detects when the page itself changed.

Layered: cheap probabilistic at the wide end, exact and persistent at the narrow end.

False-positive tuning

Bloom filter FP rate is a knob. Errors of 0.1% (10x more memory than 1%) feel "exact" in practice. For URL dedup, 1% is often fine, accidentally skipping 1% of URLs is rarely catastrophic. For dedup that gates downstream billing or alerts, prefer exact stores.

What to try

Generate 1M random URLs. Time:

  1. Adding/checking with a Python set.
  2. Same with a Bloom filter at 1% FP rate.

Compare RAM (use tracemalloc). The Bloom filter should be ~100x smaller while still rejecting 99% of duplicates correctly. Then add a content-hash check against Catalog108 /products pages, second-fetch should skip parsing entirely when the page hasn't changed.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Deduplication at Scale (Bloom Filters, Content Hashing)1 / 8

What property does a Bloom filter guarantee?

Score so far: 0 / 0