Big-O in Practice: Why Your Scraper Slows Down
Time and space complexity, the way it actually bites scrapers. Plus a one-paragraph tour of computability and P vs NP, which is mostly interview trivia.
What you’ll learn
- Read O(1), O(log n), O(n), O(n log n), O(n²) off a snippet at a glance.
- Predict which operation is the bottleneck before running a profiler.
- Explain why a scraper can run faster early and slower later (and what to do about it).
- Sound coherent if computability comes up in an interview, then move on.
CS2 gave you containers. CS3 gave you traversals. Big-O is the lens that tells you whether your combination of them will finish in 10 seconds or 10 weeks.
The whole field of "complexity analysis" fills semesters. For scraping you need maybe two hours of mental model and one habit (reading the cost of operations off your code as you write it).
The five complexities you'll meet
| Class | Name | Doubled n ⇒ time |
Where in scraping |
|---|---|---|---|
| O(1) | Constant | Same | Dict/set lookup, list append, queue enqueue/dequeue |
| O(log n) | Logarithmic | +1 step | Binary search, heap push/pop |
| O(n) | Linear | 2x | Iterating a list, parsing one HTML page |
| O(n log n) | Linearithmic | ~2.1x | Sorting (list.sort, sorted, heap-based scheduling) |
| O(n²) | Quadratic | 4x | Nested loop over the same collection, or O(n) operation done n times (e.g. if x in list_ in a loop) |
Memorise this table. 90% of the complexity reasoning you'll do in scraping is just figuring out which row a chunk of code lands on.
The big slow-down: O(n²) hiding in a one-liner
Here's the bug that catches every junior scraper exactly once:
visited = [] # ← list, not set
for url in urls:
if url not in visited: # ← O(len(visited)) per check
visited.append(url)
fetch(url)
The url not in visited is O(n), and you do it once per URL, so total work is O(n²). For 1M URLs that's 1 trillion operations. Even at 100M ops/sec, that's ~3 hours of pure list-scanning before you've fetched a single page.
The fix:
visited = set()
for url in urls:
if url not in visited: # ← O(1) per check
visited.add(url)
fetch(url)
Same code shape. Total work is O(n). For 1M URLs, that's 1M operations, milliseconds.
This is the single most important CS lesson for scrapers. The data structure choice is the algorithm's complexity.
The "starts fast, ends slow" pattern
A scraper runs at 100 URLs/second when it starts. After a few hours, it's at 5/second. Why?
Likely culprit: an O(n) operation in the hot loop where n grows.
Examples:
- Storing all results in a list and de-duping by
result not in resultsevery iteration. As results grows, dedup takes longer. - Writing each result to a CSV by re-opening the file (CSV append is O(1), but if you're rewriting the whole file each time, that's O(n)).
- Logging the URL frontier on each iteration. Logging a million-element queue is O(n) per log line.
Symptom: throughput visibly degrades with n. Fix: identify the O(n) operation and replace it (sets for membership, file handles kept open for append, sampled logs).
Reading cost off your code
You won't run a profiler for every change. With practice you'll read complexity off snippets at a glance. Three habits:
Habit 1: name the structure, name the operation, name the cost
seen = set() # set: O(1) add/lookup
if url in seen: ... # set lookup → O(1)
urls = [] # list: O(1) append, O(n) lookup
if url in urls: ... # list lookup → O(n)
Every container has a complexity profile. Once you know the seven from CS2, this becomes automatic.
Habit 2: count the nested loops
for product in products: # n iterations
for other in products: # n iterations, total = n²
if similar(product, other):
...
Two nested loops over the same collection is O(n²). For 10k products that's 100M comparisons, fine. For 1M products, 1 trillion, dead.
The fix is usually to swap the inner loop for a dict lookup:
by_key = {p["key"]: p for p in products} # O(n) build
for product in products: # n iterations
other = by_key.get(product["match_key"]) # O(1) lookup, total = O(n)
Habit 3: spot the "n inside a loop" trap
for url in urls:
visited.append(url) # O(1), fine
if visited[0] == url: ... # O(1), fine
for url in urls:
visited.insert(0, url) # O(n) per insert → O(n²) total
The fix is deque (O(1) append-left) or simply appending and reading the list in reverse.
Space complexity, the other axis
Big-O isn't just time. Space complexity is how much memory the algorithm uses. Scrapers usually have generous time budgets and tight memory budgets, so this often matters more.
The killer pattern:
all_pages = []
for url in urls:
all_pages.append(fetch(url)) # store every page in memory
For 1M pages at 100 KB each, that's 100 GB. Your laptop dies.
The fix is to stream:
for url in urls:
page = fetch(url)
extract_and_save(page) # write to disk/DB, drop the page
Same time complexity. Space drops from O(n) to O(1).
The Bloom filter (CS2) is a space-complexity trick: O(n) hash set vs O(1)-per-element bit array, at the cost of false positives. For 100M URLs, the difference is "20 GB" vs "120 MB".
A scraper-sized cheat sheet
| Pattern | Time | Space |
|---|---|---|
| Linear scan of n items | O(n) | O(1) |
| Build a dict from n items | O(n) | O(n) |
| Sort n items | O(n log n) | O(n) |
| BFS/DFS a graph with V nodes and E edges | O(V + E) | O(V) |
| Binary search a sorted array | O(log n) | O(1) |
| Dedup with set | O(n) total | O(n) |
| Dedup with Bloom filter | O(n) total | O(n bits), ~10 bytes per item |
| Naive nested-loop comparison | O(n²) | O(1) |
Memorise this and you'll predict bottlenecks before you write the profiler.
When O(n²) is fine
Not every O(n²) is a problem. If n is bounded and small (say, the products on a single page, n ≤ 50), then n² is 2,500 operations, instant. Don't optimise prematurely.
The trap is when n is the whole crawl, not a single page. Crawls grow; per-page logic doesn't. The rule:
- Per-page work: O(anything reasonable) is fine. The page is small.
- Per-crawl work: must be O(n) or O(n log n). O(n²) crashes at scale.
Computability and P vs NP, the interview trivia
For completeness, since CS courses spend weeks here:
Computability asks "what's solvable, in principle, by any algorithm at all?" Some problems (the halting problem, Rice's theorem) are undecidable: no algorithm can solve them, ever. None of this affects scraping. You won't be asked to halt-check an arbitrary program. Move on.
P vs NP asks "is every problem we can quickly verify, also one we can quickly solve?" Open question for 50 years. NP-hard problems (travelling salesman, SAT) take exponential time on every known algorithm. None of these are in your daily scraping work. The one place they show up: optimisation problems hiding in scheduling. "Schedule N scrapes across M proxies to minimise total time" is a flavor of bin-packing, which is NP-hard. In practice you use a greedy heuristic and accept it's not optimal.
In an interview, "what's P vs NP" is a culture-fit question, not a technical one. Be able to say "P is what we can solve quickly, NP is what we can verify quickly, whether they're equal is open" and you've passed. Don't invest more than that.
Where to practice
- Write a one-page Python file with three implementations of "dedup 1M URLs": one with a list, one with a set, one with a Bloom filter (use
pybloom-live). Time each. The list version should take minutes; the set version, milliseconds; the Bloom filter, similar to the set but with a fraction of the memory. - For one of your existing scrapers, pick the hottest loop and write down the complexity of every line. Is anything O(n) inside an O(n) loop? Fix it.
- LeetCode "Top Interview 150" filter, the easy and medium problems. Each one is a complexity puzzle in disguise. 50 problems gets you fluent.
Next: CS5 wraps the sub-path with the actual study plan: which DSA course, how many LeetCode problems, what to skip.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.