Algorithms & Traversal: BFS, DFS, and the Searches You'll Reach For
BFS and DFS are how a crawler walks a site. Plus binary search, sorting, recursion, and the one piece of DP you'll meet in scraping work.
What you’ll learn
- Write BFS and DFS over a website's link graph and explain when to use each.
- Pick binary search over linear search and know what 'sorted' means in scraping context.
- Recognise when recursion is easier than iteration, and when it's about to blow the stack.
- Spot the one place dynamic programming shows up in scraping (diffing two HTML versions).
CS2 gave you the containers. This lesson is how you walk them. The headline insight: a crawler is a graph traversal algorithm wearing scraper clothes. Once you see that, the rest follows.
BFS: breadth-first, the default crawl shape
Breadth-first search explores a graph one "level" at a time: visit all of node A's neighbours before going deeper into any of them.
A
/ | \
B C D ← BFS visits in this order:
/ \ | A, B, C, D, E, F, G, H
E F G
|
H
For a crawler, BFS means: from the homepage, fetch every page it links to before following any of those pages further. You discover the site's wide structure first, then drill down.
The data structure that makes BFS work is a queue (CS2):
from collections import deque
def crawl_bfs(seed_url):
frontier = deque([seed_url])
visited = set()
while frontier:
url = frontier.popleft() # ← queue: dequeue from front
if url in visited:
continue
visited.add(url)
for link in fetch_and_extract_links(url):
if link not in visited:
frontier.append(link) # ← enqueue at back
Why this is the right default:
- Even site coverage. You don't get lost in one deep section while ignoring others.
- Polite by accident. You hit each section's first level before recursing, which spreads requests across the site rather than hammering one branch.
- Predictable depth. If you stop the crawl early, you have everything up to depth N, not a random tendril.
DFS: depth-first, when you want to go deep first
Depth-first search goes as far as possible down one branch before backtracking.
A
/ | \
B C D ← DFS visits in this order:
/ \ | A, B, E, F, C, D, G, H
E F G
|
H
The data structure is a stack (or recursion's implicit stack):
def crawl_dfs(seed_url):
frontier = [seed_url]
visited = set()
while frontier:
url = frontier.pop() # ← stack: pop from end
if url in visited:
continue
visited.add(url)
for link in fetch_and_extract_links(url):
if link not in visited:
frontier.append(link)
The only difference from BFS is popleft vs pop. Same lines, different shape of crawl.
When to pick DFS over BFS:
- You want one full category before another. Scraping a category tree and you want all of "Electronics" done before "Books" starts.
- You're walking the DOM, not the link graph. DOM traversal is naturally recursive; DFS is the right fit.
- Memory is tight on a wide site. BFS's queue holds the entire frontier (which can grow huge on link-rich pages); DFS's stack only holds the current path.
90% of crawlers are BFS. Knowing DFS is the other tool in the toolkit makes the choice deliberate.
Recursion: when, and when it'll bite
Recursion is a function that calls itself with a smaller problem. The classic example is DOM walking:
def find_all_links(node):
links = []
if node.name == "a" and node.get("href"):
links.append(node["href"])
for child in node.children:
if hasattr(child, "children"):
links.extend(find_all_links(child)) # ← recursive call
return links
It's elegant. It's also dangerous: every recursive call adds a frame to Python's call stack, which defaults to 1000 frames. A deeply nested DOM, or a deep crawl from a recursive crawl(url) function, hits RecursionError and crashes.
Two fixes:
- Convert to iteration with an explicit stack (the DFS pattern above). Same algorithm, no recursion depth limit.
- Raise the limit:
sys.setrecursionlimit(10000). Quick fix but kicks the problem down the road and risks an actual stack overflow.
Rule of thumb: for crawls, always iterate with an explicit stack/queue. For DOM walks of well-formed HTML, recursion is fine.
Binary search: the moment "sorted" pays off
You scraped 1M products, dumped them into a CSV, and now you need to look up specific SKUs. Linear scan is O(n), 1M checks per lookup. Sort once, binary-search forever: O(log n), ~20 checks per lookup.
from bisect import bisect_left
products.sort(key=lambda p: p["sku"])
skus = [p["sku"] for p in products]
def find_by_sku(target):
i = bisect_left(skus, target)
if i < len(skus) and skus[i] == target:
return products[i]
return None
Two scraping situations where this matters:
- Lookups against a sorted snapshot. You diff yesterday's scrape vs today's: sort both by ID, scan in parallel, O(n). Without sort + binary, you're at O(n²) for the diff.
- Range queries. "All products priced between $50 and $100" over a sorted-by-price list is O(log n + k) where k is the matches. Useful when you're slicing a large scraped catalogue.
If you don't need range queries and lookups dominate, just use a dict (CS2): O(1) is better than O(log n). Binary search wins specifically when (a) you need ordering for other reasons, or (b) data lives in a sorted file you can't load all of into memory.
Sorting: O(n log n), don't roll your own
Every language ships with a good sort. Python's list.sort() is Timsort, O(n log n), stable. Don't write quicksort by hand for scraper work.
Things you should know about sort:
- Stable means equal elements keep their relative order. Useful when you sort by secondary key then primary, the secondary order survives.
- Custom keys with
key=lambda p: p["price"]. Cheaper thancmp=...because the key is computed once per element. - Reverse with
reverse=True. No need to sort then reverse.
# Sort products by category, then by price within category.
products.sort(key=lambda p: p["price"]) # secondary key first
products.sort(key=lambda p: p["category"]) # primary key second
# Stable sort preserves price order within each category.
Dynamic programming: the one place it shows up
DP is "solve a problem by combining solutions to overlapping sub-problems." 95% of scraping work never uses it.
The 5% where it shows up: diffing two versions of the same scraped page. "How has this product description changed since yesterday?" boils down to edit distance (Levenshtein) or longest-common-subsequence, both classic DP problems. Python's difflib already implements them, you'd call the library, not write the algorithm.
But if you ever interview for a data-engineering role, expect one DP question (usually edit distance, climbing stairs, or coin change). Solve a dozen on LeetCode and move on. DP is interview surface area, not daily scraping work.
Cycle detection: the one algorithm you must internalise
The single most common scraper bug after "naive list dedup" is the infinite crawl loop. Page A links to B, B links back to A, your crawler bounces forever.
The fix is one line: a visited: set (CS2) that you check before enqueueing.
if link not in visited:
frontier.append(link)
This is graph cycle detection in its simplest form. If you ever skip it, every crawl over a real site will infinite-loop. The set is mandatory, not optional.
For more complex cycle detection (e.g. in linked-list interview problems), the standard trick is Floyd's tortoise-and-hare: two pointers, one moves one step at a time, the other two; if they meet, there's a cycle. Won't come up in scraping work but does in interviews.
A worked example: crawl + extract + dedup
Putting BFS, hash sets, and basic algorithms together:
from collections import deque
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def crawl(seed, max_pages=1000):
frontier = deque([seed])
visited = set()
seen_products = set()
products = []
while frontier and len(visited) < max_pages:
url = frontier.popleft()
if url in visited:
continue
visited.add(url)
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Extract products, dedup by SKU.
for p in soup.select(".product"):
sku = p["data-sku"]
if sku not in seen_products:
seen_products.add(sku)
products.append({"sku": sku, "title": p.text})
# Enqueue same-host links.
for a in soup.find_all("a", href=True):
link = urljoin(url, a["href"])
if urlparse(link).netloc == urlparse(seed).netloc:
if link not in visited:
frontier.append(link)
return products
Three sets (visited, seen_products, implicit in the dedup pattern), one queue, one list. BFS, O(1) dedup. This is the canonical scraper skeleton.
Where to practice
- Take the BFS crawler above and convert it to DFS. Run both on a small site (
practice.scrapingcentral.com). Compare the order of URLs visited. The visit-order difference is the algorithm. - LeetCode: "Number of Islands" (DFS), "Binary Tree Level Order Traversal" (BFS), "Course Schedule" (cycle detection). These three problems cover 80% of what graph questions look like.
- Python's stdlib
bisectandheapqmodules each have ~20 lines of example code in the docs. Read both. You'll use them in real work, especiallyheapqonce your URL frontier needs priorities.
Next: CS4 is the Big-O lens. Same code, but now you can predict whether it'll finish in 10 seconds or 10 weeks.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.