CSS Selectors, Complete Reference
Every CSS selector you need to know, organised by what you'll actually use them for in scrapers.
What you’ll learn
- Write selectors that target elements by tag, class, id, attribute, and combinations of all four.
- Use combinators (descendant, child, sibling) correctly.
- Use position pseudo-classes (`:nth-child`, `:first-of-type`, `:last-child`) to pick specific items in a list.
- Write resilient selectors that survive minor markup changes.
CSS selectors are the universal language of "find me this element on the page." Every scraping library supports them (BeautifulSoup.select(), lxml.cssselect, Symfony DomCrawler's filter(), Playwright's locators). Learn them once, use them everywhere.
The five basics
| Selector | Matches | Example |
|---|---|---|
tag |
All elements of that tag | a (every link) |
.class |
Elements with that class | .product |
#id |
The element with that id | #main-banner |
[attr] |
Elements that have the attribute | [data-id] |
[attr="value"] |
Elements with the attribute set to value | [data-id="42"] |
Combine them with no spacing:
a.btn.primary an <a> that has both .btn and .primary
input[type="email"] an <input> with type=email
div#root.dark-theme the <div> that has id="root" AND class .dark-theme
The combinators
| Combinator | Symbol | Meaning |
|---|---|---|
| Descendant | (space) | Any descendant, at any depth |
| Child | > |
Direct child only |
| Adjacent sibling | + |
The immediately-following sibling |
| General sibling | ~ |
Any following sibling |
Examples on a typical product page:
article.product .price /* .price anywhere under article.product */
article.product > .price /* .price ONLY if it's a direct child */
h2 + p /* the <p> immediately after an <h2> */
h2 ~ p /* any <p> after an <h2> at the same level */
The descendant combinator (space) is the most common; child (>) is what you reach for when nested duplicates trip you up.
Attribute selectors with operators
Beyond plain [attr="value"], the operators give you partial matching:
| Selector | Matches |
|---|---|
[attr^="x"] |
Attribute starts with "x" |
[attr$="x"] |
Attribute ends with "x" |
[attr*="x"] |
Attribute contains "x" |
[attr~="x"] |
Attribute is a space-separated list containing "x" (mostly for class) |
[attr|="x"] |
Attribute equals "x" or starts with "x-" (mostly for lang) |
Genuinely useful:
a[href^="https://"] /* external links */
a[href$=".pdf"] /* PDF downloads */
img[src*="cdn.example.com"] /* images on a specific CDN */
input[name="csrf_token"] /* the CSRF input, exact match */
Pseudo-classes for position
The position pseudo-classes are what let you say "the third product" or "every other row":
| Pseudo-class | Matches |
|---|---|
:first-child |
First child of its parent |
:last-child |
Last child of its parent |
:nth-child(n) |
The Nth child (1-indexed) |
:nth-child(2n) |
Every even-positioned child |
:nth-child(2n+1) |
Every odd-positioned child |
:nth-last-child(n) |
Nth counting from the end |
:first-of-type |
First of that tag among siblings |
:last-of-type |
Last of that tag among siblings |
:nth-of-type(n) |
Nth of that tag, what you usually want |
The :nth-child vs :nth-of-type distinction trips everyone up. :nth-child(1) means "the first child of the parent, if it happens to be this tag." :nth-of-type(1) means "the first sibling that IS this tag." When the parent has mixed children, you almost always want :nth-of-type.
Pseudo-classes that aren't position
| Pseudo-class | Matches | Scraping use |
|---|---|---|
:not(selector) |
Elements NOT matching the inner selector | Exclude promoted items: .product:not(.sponsored) |
:has(selector) |
Elements that contain a matching descendant | tr:has(td.in-stock) (newer; supported in Playwright, lxml 5+) |
:contains("text") |
Elements containing text (jQuery extension, BeautifulSoup string=, Playwright text=) |
Brittle, use sparingly |
:empty |
Elements with no children | Identify placeholder rows |
:not() is the workhorse. :contains() is not standard CSS, different libraries spell it differently:
# BeautifulSoup
soup.select('h2:-soup-contains("Free shipping")')
# Playwright
page.locator("h2", has_text="Free shipping")
# lxml.cssselect doesn't support :contains at all, use XPath instead
Treat text-based matching as a last resort; prefer class, id, or data-*.
Putting it together: a real selector
Suppose you want to extract the price of every non-promoted product on a listing, but only those in stock:
article.product:not(.sponsored):has(.in-stock-badge) .price
Read it left to right:
article.product, every product card:not(.sponsored), except the promoted ones:has(.in-stock-badge), that contain an in-stock badge.price, descendant.price
One line, ~40 characters, replaces 15 lines of imperative DOM-walking.
Writing resilient selectors
A selector that breaks the first time the site redesigns is technical debt. Two rules:
-
Prefer semantic anchors.
article.product,[data-product-id], andh1are stable.div.css-1xyz9(auto-generated by a CSS-in-JS framework) is not, that class name changes on every deploy. -
Anchor short, not long.
.product .pricesurvives more changes thanbody > div.layout > main > section.products > div.row > article.product > div.body > p.price. Each extra layer is a brittleness point.
The Catalog108 challenge pages deliberately use a mix of semantic and non-semantic markup so you can practise picking durable anchors.
Hands-on lab
Open practice.scrapingcentral.com/challenges/static/lists/cards and write a single CSS selector that grabs every card title. Then write a selector that picks only the cards marked "featured", and another that excludes them. Run all three with BeautifulSoup's select() and verify the counts. Then come back when you've read the XPath lesson and try the same in XPath, comparing the two against the same markup is the fastest way to internalize both.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/lists/cardsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.