Data Structures for Scrapers
The seven data structures you'll actually reach for in scraping: arrays, hash sets/maps, queues, stacks, heaps, trees, graphs. Each one framed around a real scraping problem.
What you’ll learn
- Pick the right structure for dedup, the URL frontier, the crawl graph, and priority scheduling.
- Read O(1) vs O(n) operation costs off each structure's API at a glance.
- Recognise the trade-off between hash sets and Bloom filters at dedup scale.
- Sketch out a small crawler's internal data layout from memory.
Seven structures. Each one solves a concrete scraping problem. We'll walk them in order from "you use this every day" to "you use this once in a while but it changes the design when you do."
1. Arrays / lists, the default container
Pythonists call them list, JS calls them Array. In memory: contiguous slots, indexed from 0.
| Operation | Cost |
|---|---|
Index into position i |
O(1) |
| Append to end | O(1) amortised |
| Insert/delete at front or middle | O(n) |
Check membership (x in lst) |
O(n) |
| Length | O(1) |
Use lists for ordered collections you'll iterate over. Don't use them for membership checks. The single most common scraper bug is if url in visited_list over a list that grew to 100k+ entries.
scraped_titles = [] # fine, you iterate it once at the end
visited_urls = set() # not a list — see below
2. Hash sets and hash maps, the workhorses
A set (or dict) is a hash table. It trades memory for O(1) average lookup.
| Operation | list |
set/dict |
|---|---|---|
| Add | O(1) | O(1) |
| Membership / lookup | O(n) | O(1) average |
| Delete | O(n) | O(1) average |
Where they show up in a scraper:
visited = set() # URLs we've already fetched
url_to_html = {} # URL → cached response body
brand_count = {} # tally of how many products per brand
seen_skus = set() # dedup key for products
Rule of thumb: if you ever ask "have I seen this before?", you want a set or dict, never a list.
Hashable keys
Hash structures need their keys to be hashable. In Python that means immutable: strings, ints, tuples of those, frozensets. Lists and dicts are not hashable. So if you dedup products on (brand, sku), use a tuple, not a list:
seen.add((product["brand"], product["sku"])) # works
seen.add([product["brand"], product["sku"]]) # TypeError
When a hash set is too big: Bloom filters
A set of 100M URL hashes in Python eats ~20 GB. A Bloom filter that handles the same 100M with a 1% false-positive rate costs ~120 MB. The trade-off: Bloom filters say "definitely not seen" or "maybe seen", they never say "definitely seen". For dedup that's usually fine: a false positive means you skip a URL you haven't actually visited, you lose one page, not the crawl.
Production sub-path covers Bloom filters in detail. For now: know they exist and that they're the right answer when a set blows up your memory.
3. Queues, the URL frontier
A queue is FIFO: first in, first out. You enqueue at the back, dequeue from the front.
enqueue ──► [a][b][c][d] ──► dequeue
A crawler's "URL frontier" (the list of URLs waiting to be fetched) is almost always a queue. New URLs found on a page get enqueued at the back; the next URL to fetch is dequeued from the front. This naturally gives you breadth-first crawling, see CS3.
Python options:
from collections import deque
frontier = deque()
frontier.append(url) # enqueue, O(1)
next_url = frontier.popleft() # dequeue, O(1)
Don't use list.pop(0), that's O(n) because every other element has to shift down.
For multi-threaded scrapers, use queue.Queue (thread-safe) or asyncio.Queue (async-safe). Same idea, different concurrency guarantees.
4. Stacks, depth-first crawling
A stack is LIFO: last in, first out.
push ──► [d][c][b][a] ──► pop (from top)
In Python, a list with .append() and .pop() is already a stack:
stack = []
stack.append(url) # push, O(1)
next_url = stack.pop() # pop from end, O(1)
When you'd reach for a stack in scraping:
- Depth-first crawl of a site (you want to go deep into one category before backing out).
- Iterative DOM walking (replace a recursive walk with an explicit stack to avoid recursion depth limits).
- Backtracking through navigation when you hit a dead end.
90% of scrapers use a queue (breadth-first). When you specifically want depth-first behaviour, swap the queue for a stack and nothing else changes. That's the elegance of these structures.
5. Priority queues / heaps, scheduling
A heap is a tree-shaped structure where the smallest (or largest) element is always at the root. Pop the root in O(log n) and the next-smallest bubbles up.
Why a scraper cares: not all URLs are equal. You might want to fetch:
- The shallowest URLs first (priority = depth).
- The most-linked-to URLs first (priority = inbound count).
- The least-recently-crawled URLs first (priority = age).
A priority queue lets you encode any of these. Python:
import heapq
frontier = []
heapq.heappush(frontier, (priority, url))
priority, url = heapq.heappop(frontier)
Scrapy uses a priority-based frontier under the hood. If you ever need scheduling beyond pure FIFO/LIFO, you want a heap.
| Operation | Cost |
|---|---|
| Push | O(log n) |
| Pop min/max | O(log n) |
| Peek min/max | O(1) |
6. Trees, parsed HTML
You already use trees every time you parse HTML. The DOM is a tree. BeautifulSoup and lxml give you a node-and-children structure you walk with .find(), .find_all(), .children, .parent.
The two tree concepts you actually need:
- Traversal: depth-first (recursive descent) vs breadth-first (level by level). CS3 covers this.
- Trees are graphs without cycles. Anything you can do on a tree, you can also do on a graph (the crawl graph, next).
You will not implement a binary search tree from scratch in scraping work. You will walk DOM trees constantly. Knowing the vocabulary (root, child, descendant, leaf) makes selector and XPath docs read smoother.
7. Graphs, the crawl structure
A website is a directed graph: pages are nodes, links are edges. A crawler is a graph traversal algorithm.
You almost never need to "implement" a graph in scraping, the structure is implicit in the pages and links you discover. But three concepts matter:
- Adjacency: "what does this node point to?" For a scraper, that's "what links does this page contain?"
- Cycles: sites link back to themselves, A → B → A is a cycle. Without a
visitedset, you crawl forever. - Connected components: a site might have sub-graphs you can't reach from the homepage (orphaned pages). Sitemap.xml is how you find them.
Two ways to store a graph if you ever do need to materialise one (for analytics, link-graph dumps):
# Adjacency list, the usual choice
graph = {
"/": ["/products", "/about"],
"/products": ["/products/1", "/products/2"],
"/products/1": ["/"], # cycle!
}
# Adjacency matrix, dense graphs only, rare in scraping
Adjacency lists are O(V + E) memory, matrices are O(V²). For sparse web graphs (most sites), use a list.
A small crawler's data layout
Putting it together, here's the data layout of a minimal breadth-first crawler:
from collections import deque
frontier = deque(["https://example.com/"]) # queue
visited = set() # hash set, O(1) dedup
results = [] # list, just for output
while frontier:
url = frontier.popleft()
if url in visited:
continue
visited.add(url)
html = fetch(url)
results.append(extract(html))
for link in links_in(html):
if link not in visited:
frontier.append(link)
Four data structures (deque, set, list, and the implicit graph). Every production crawler is a more elaborate version of this. Once you see the shape, Scrapy's internals are no longer mysterious.
Where to practice
- Read the Python docs on
collections.dequeandheapq. Both are stdlib, both are fast, both are what production scrapers actually use. - Take the crawler above and convert it from BFS to DFS (swap
dequeforlist,popleftforpop). The behaviour change is dramatic; the code change is one line. That's the point of these structures. - Open Scrapy's
scrapy/core/scheduler.pyand look at the priority queue. You should recognise the pieces.
Next: CS3 covers how a crawler actually walks these structures, BFS, DFS, and the searches you'll reach for inside scraped data.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.