Big-O in Practice: Why Your Scraper Slows Down, CS Fundamentals for Scrapers

Time and space complexity, the way it actually bites scrapers. Plus a one-paragraph tour of computability and P vs NP, which is mostly interview trivia.

CS2 gave you containers. CS3 gave you traversals. Big-O is the lens that tells you whether your combination of them will finish in 10 seconds or 10 weeks.

The whole field of "complexity analysis" fills semesters. For scraping you need maybe two hours of mental model and one habit (reading the cost of operations off your code as you write it).

The five complexities you'll meet

Class	Name	Doubled `n` ⇒ time	Where in scraping
O(1)	Constant	Same	Dict/set lookup, list append, queue enqueue/dequeue
O(log n)	Logarithmic	+1 step	Binary search, heap push/pop
O(n)	Linear	2x	Iterating a list, parsing one HTML page
O(n log n)	Linearithmic	~2.1x	Sorting (list.sort, sorted, heap-based scheduling)
O(n²)	Quadratic	4x	Nested loop over the same collection, or O(n) operation done n times (e.g. `if x in list_` in a loop)

Memorise this table. 90% of the complexity reasoning you'll do in scraping is just figuring out which row a chunk of code lands on.

The big slow-down: O(n²) hiding in a one-liner

Here's the bug that catches every junior scraper exactly once:

visited = []                  # ← list, not set
for url in urls:
    if url not in visited:    # ← O(len(visited)) per check
        visited.append(url)
        fetch(url)

The url not in visited is O(n), and you do it once per URL, so total work is O(n²). For 1M URLs that's 1 trillion operations. Even at 100M ops/sec, that's ~3 hours of pure list-scanning before you've fetched a single page.

The fix:

visited = set()
for url in urls:
    if url not in visited:    # ← O(1) per check
        visited.add(url)
        fetch(url)

Same code shape. Total work is O(n). For 1M URLs, that's 1M operations, milliseconds.

This is the single most important CS lesson for scrapers. The data structure choice is the algorithm's complexity.

The "starts fast, ends slow" pattern

A scraper runs at 100 URLs/second when it starts. After a few hours, it's at 5/second. Why?

Likely culprit: an O(n) operation in the hot loop where n grows.

Examples:

Storing all results in a list and de-duping by result not in results every iteration. As results grows, dedup takes longer.
Writing each result to a CSV by re-opening the file (CSV append is O(1), but if you're rewriting the whole file each time, that's O(n)).
Logging the URL frontier on each iteration. Logging a million-element queue is O(n) per log line.

Symptom: throughput visibly degrades with n. Fix: identify the O(n) operation and replace it (sets for membership, file handles kept open for append, sampled logs).

Reading cost off your code

You won't run a profiler for every change. With practice you'll read complexity off snippets at a glance. Three habits:

Habit 1: name the structure, name the operation, name the cost

seen = set()             # set: O(1) add/lookup
if url in seen: ...      # set lookup → O(1)

urls = []                # list: O(1) append, O(n) lookup
if url in urls: ...      # list lookup → O(n)

Every container has a complexity profile. Once you know the seven from CS2, this becomes automatic.

Habit 2: count the nested loops

for product in products:           # n iterations
    for other in products:         # n iterations, total = n²
        if similar(product, other):
            ...

Two nested loops over the same collection is O(n²). For 10k products that's 100M comparisons, fine. For 1M products, 1 trillion, dead.

The fix is usually to swap the inner loop for a dict lookup:

by_key = {p["key"]: p for p in products}    # O(n) build
for product in products:                    # n iterations
    other = by_key.get(product["match_key"]) # O(1) lookup, total = O(n)

Habit 3: spot the "n inside a loop" trap

for url in urls:
    visited.append(url)              # O(1), fine
    if visited[0] == url: ...        # O(1), fine

for url in urls:
    visited.insert(0, url)           # O(n) per insert → O(n²) total

The fix is deque (O(1) append-left) or simply appending and reading the list in reverse.

Space complexity, the other axis

Big-O isn't just time. Space complexity is how much memory the algorithm uses. Scrapers usually have generous time budgets and tight memory budgets, so this often matters more.

The killer pattern:

all_pages = []
for url in urls:
    all_pages.append(fetch(url))     # store every page in memory

For 1M pages at 100 KB each, that's 100 GB. Your laptop dies.

The fix is to stream:

for url in urls:
    page = fetch(url)
    extract_and_save(page)           # write to disk/DB, drop the page

Same time complexity. Space drops from O(n) to O(1).

The Bloom filter (CS2) is a space-complexity trick: O(n) hash set vs O(1)-per-element bit array, at the cost of false positives. For 100M URLs, the difference is "20 GB" vs "120 MB".

A scraper-sized cheat sheet

Pattern	Time	Space
Linear scan of n items	O(n)	O(1)
Build a dict from n items	O(n)	O(n)
Sort n items	O(n log n)	O(n)
BFS/DFS a graph with V nodes and E edges	O(V + E)	O(V)
Binary search a sorted array	O(log n)	O(1)
Dedup with set	O(n) total	O(n)
Dedup with Bloom filter	O(n) total	O(n bits), ~10 bytes per item
Naive nested-loop comparison	O(n²)	O(1)

Memorise this and you'll predict bottlenecks before you write the profiler.

When O(n²) is fine

Not every O(n²) is a problem. If n is bounded and small (say, the products on a single page, n ≤ 50), then n² is 2,500 operations, instant. Don't optimise prematurely.

The trap is when n is the whole crawl, not a single page. Crawls grow; per-page logic doesn't. The rule:

Per-page work: O(anything reasonable) is fine. The page is small.
Per-crawl work: must be O(n) or O(n log n). O(n²) crashes at scale.

Computability and P vs NP, the interview trivia

For completeness, since CS courses spend weeks here:

Computability asks "what's solvable, in principle, by any algorithm at all?" Some problems (the halting problem, Rice's theorem) are undecidable: no algorithm can solve them, ever. None of this affects scraping. You won't be asked to halt-check an arbitrary program. Move on.

P vs NP asks "is every problem we can quickly verify, also one we can quickly solve?" Open question for 50 years. NP-hard problems (travelling salesman, SAT) take exponential time on every known algorithm. None of these are in your daily scraping work. The one place they show up: optimisation problems hiding in scheduling. "Schedule N scrapes across M proxies to minimise total time" is a flavor of bin-packing, which is NP-hard. In practice you use a greedy heuristic and accept it's not optimal.

In an interview, "what's P vs NP" is a culture-fit question, not a technical one. Be able to say "P is what we can solve quickly, NP is what we can verify quickly, whether they're equal is open" and you've passed. Don't invest more than that.

Where to practice

Write a one-page Python file with three implementations of "dedup 1M URLs": one with a list, one with a set, one with a Bloom filter (use pybloom-live). Time each. The list version should take minutes; the set version, milliseconds; the Bloom filter, similar to the set but with a fraction of the memory.
For one of your existing scrapers, pick the hottest loop and write down the complexity of every line. Is anything O(n) inside an O(n) loop? Fix it.
LeetCode "Top Interview 150" filter, the easy and medium problems. Each one is a complexity puzzle in disguise. 50 problems gets you fluent.

Next: CS5 wraps the sub-path with the actual study plan: which DSA course, how many LeetCode problems, what to skip.

Big-O in Practice: Why Your Scraper Slows Down

What you’ll learn