BeautifulSoup: find, find_all, select, Static Scraping

The three workhorse selection methods of BeautifulSoup, when to use each, and the small idioms that separate beginner from comfortable.

You've used these methods in passing. This lesson is the rigorous tour, every option, every gotcha, every idiom. After this, parsing decisions become muscle memory.

The three methods

Method	Returns	Selector style
`find`	First match (or `None`)	Tag name + attribute kwargs
`find_all`	List of all matches (possibly empty)	Tag name + attribute kwargs
`select`	List of all matches (possibly empty)	CSS selector string
`select_one`	First match (or `None`)	CSS selector string

find and find_all are BeautifulSoup's native API. select and select_one were added later (when CSS selectors became universal). Most working code uses both, find for simple cases, select when CSS is cleaner.

`find` and `find_all`, tag name + filters

soup.find("h1")  # first <h1>
soup.find("div", class_="product-card")  # first div with class product-card
soup.find("a", id="nav-home")  # first <a> with id="nav-home"
soup.find("input", attrs={"name": "csrf_token"})  # arbitrary attrs
soup.find_all("p")  # every <p>
soup.find_all("p", limit=3)  # first 3 <p>s
soup.find_all(["h1", "h2", "h3"])  # multiple tags
soup.find_all("div", class_="card", string="In stock")  # text content filter

Note class_ (with trailing underscore), class is a Python keyword. Same trick for for becomes for_.

The attrs={} form is essential when the attribute name has a hyphen (data-id, aria-label), those can't be Python kwargs:

soup.find("div", attrs={"data-id": "42"})
soup.find("button", attrs={"aria-label": "Close"})

Class matching has a quirk

HTML classes are space-separated; BeautifulSoup matches against ANY of them by default:

# HTML: <div class="card featured large">
soup.find("div", class_="card")  # matches
soup.find("div", class_="featured")  # matches
soup.find("div", class_="card featured")  # matches if both present (any order)

For exact-string class match, use a function or pass class_ with a regex/list:

import re
soup.find_all("div", class_=re.compile(r"^card$"))

`select` and `select_one`, CSS selectors

If you know CSS, you already know this:

soup.select("article.product-card h2")  # descendant
soup.select("article.product-card > h2")  # direct child only
soup.select("a[href^='/products/']")  # attribute starts-with
soup.select("a[href$='.pdf']")  # attribute ends-with
soup.select("a[href*='kitchen']")  # attribute contains
soup.select("li:nth-of-type(2)")  # nth match
soup.select("div.card:not(.disabled)")  # exclusion
soup.select("p ~ a")  # general sibling
soup.select("h2 + p")  # adjacent sibling

For anything non-trivial, select is shorter than the equivalent find_all chain.

Element methods after selection

el.name  # tag name as string ("div")
el.get_text(strip=True)  # all descendant text, with whitespace cleaned
el.get_text(" ", strip=True)  # join descendant text with single space
el.string  # only if there's exactly ONE text child; else None
el["class"]  # list of classes (because HTML class is multi-valued)
el["href"]  # string for single-valued attrs
el.get("href")  # safe version, returns None if missing
el.attrs  # full dict of attributes

el["href"] raises KeyError if missing. el.get("href") returns None. Always prefer .get() unless you know the attribute exists.

`.string` vs `.get_text()`, pick the right one

<p>Hello <b>world</b></p>

p.string  # None, p has multiple children (text + <b>)
p.get_text()  # "Hello world"
p.b.string  # "world", b has exactly one text child
p.b.get_text()  # "world"

.string is finicky. For 95% of scraping, just use .get_text(strip=True).

Combining `select` and `find` cleanly

A common pattern: select cards, then find inside each card:

for card in soup.select("article.product-card"):
  name  = card.find("h2").get_text(strip=True)
  price = card.find(class_="price").get_text(strip=True)
  link  = card.find("a")["href"]

card.find(...) is scoped to that card's subtree, exactly as you'd want.

The `NoneType` trap

The most common BeautifulSoup error:

title = soup.find("h1").get_text()
# AttributeError: 'NoneType' object has no attribute 'get_text'

find returned None because no <h1> exists, then you called a method on it. Three defensive patterns:

# 1. Check first
h1 = soup.find("h1")
title = h1.get_text(strip=True) if h1 else None

# 2. Walrus operator (Python 3.8+)
title = h1.get_text(strip=True) if (h1 := soup.find("h1")) else None

# 3. Helper
def safe_text(el):
  return el.get_text(strip=True) if el else None

title = safe_text(soup.find("h1"))

For production scrapers, the helper approach scales best.

Searching by text content

soup.find("a", string="Next page")  # exact string match
soup.find_all("a", string=re.compile(r"page"))  # regex
soup.find(string=re.compile(r"\$\d+"))  # find any text node matching

Useful when the structure is messy but you know the visible label.

Filtering by callable

The most powerful form, pass any function returning bool:

def is_external_link(tag):
  return (tag.name == "a"
  and tag.get("href", "").startswith("http")
  and "scrapingcentral.com" not in tag.get("href", ""))

external = soup.find_all(is_external_link)

Almost any custom matching logic fits in a callable. Use it when CSS gets convoluted.

`find_parent`, `find_next_sibling`, `find_previous`

Navigate the tree from a known anchor:

price_label = soup.find(string="Price")
# Find the value next to it
price_value = price_label.find_next("span")  # next <span> in document order
price_row  = price_label.find_parent("tr")  # enclosing table row

This pattern, "find the label, then walk to the value", is endlessly useful on layouts where the data has no class hook.

Performance tip

For a soup with thousands of nodes, find_all with strict filters is faster than broad select followed by Python-level filtering. If perf matters, prefer the most specific selector you can write.

Hands-on lab

Visit /challenges/static/lists/cards. Use select to find every card, then for each card use find to extract the title, subtitle, and any visible badge. Try doing the same task with find_all instead of select to feel the difference. Confirm both approaches yield the same data.

BeautifulSoup: find, find_all, select

What you’ll learn

The three methods

`find` and `find_all`, tag name + filters

Class matching has a quirk

`select` and `select_one`, CSS selectors

Element methods after selection

`.string` vs `.get_text()`, pick the right one

Combining `select` and `find` cleanly

The `NoneType` trap

Searching by text content

Filtering by callable

`find_parent`, `find_next_sibling`, `find_previous`

Performance tip

Hands-on lab

Hands-on lab

Quiz, check your understanding

What does `soup.find('h1')` return when no `<h1>` exists?

BeautifulSoup: find, find_all, select

What you’ll learn

The three methods

find and find_all, tag name + filters

Class matching has a quirk

select and select_one, CSS selectors

Element methods after selection

.string vs .get_text(), pick the right one

Combining select and find cleanly

The NoneType trap

Searching by text content

Filtering by callable

find_parent, find_next_sibling, find_previous

Performance tip

Hands-on lab

Hands-on lab

Quiz, check your understanding

What does `soup.find('h1')` return when no `<h1>` exists?

`find` and `find_all`, tag name + filters

`select` and `select_one`, CSS selectors

`.string` vs `.get_text()`, pick the right one

Combining `select` and `find` cleanly

The `NoneType` trap

`find_parent`, `find_next_sibling`, `find_previous`