CSS Selectors, Complete Reference, Foundations

Every CSS selector you need to know, organised by what you'll actually use them for in scrapers.

CSS selectors are the universal language of "find me this element on the page." Every scraping library supports them (BeautifulSoup.select(), lxml.cssselect, Symfony DomCrawler's filter(), Playwright's locators). Learn them once, use them everywhere.

The five basics

Selector	Matches	Example
`tag`	All elements of that tag	`a` (every link)
`.class`	Elements with that class	`.product`
`#id`	The element with that id	`#main-banner`
`[attr]`	Elements that have the attribute	`[data-id]`
`[attr="value"]`	Elements with the attribute set to value	`[data-id="42"]`

Combine them with no spacing:

a.btn.primary  an <a> that has both .btn and .primary
input[type="email"]  an <input> with type=email
div#root.dark-theme  the <div> that has id="root" AND class .dark-theme

The combinators

Combinator	Symbol	Meaning
Descendant	(space)	Any descendant, at any depth
Child	`>`	Direct child only
Adjacent sibling	`+`	The immediately-following sibling
General sibling	`~`	Any following sibling

Examples on a typical product page:

article.product .price  /* .price anywhere under article.product */
article.product > .price  /* .price ONLY if it's a direct child */
h2 + p  /* the <p> immediately after an <h2> */
h2 ~ p  /* any <p> after an <h2> at the same level */

The descendant combinator (space) is the most common; child (>) is what you reach for when nested duplicates trip you up.

Attribute selectors with operators

Beyond plain [attr="value"], the operators give you partial matching:

Selector	Matches
`[attr^="x"]`	Attribute starts with "x"
`[attr$="x"]`	Attribute ends with "x"
`[attr*="x"]`	Attribute contains "x"
`[attr~="x"]`	Attribute is a space-separated list containing "x" (mostly for `class`)
`[attr\|="x"]`	Attribute equals "x" or starts with "x-" (mostly for `lang`)

Genuinely useful:

a[href^="https://"]  /* external links */
a[href$=".pdf"]  /* PDF downloads */
img[src*="cdn.example.com"]  /* images on a specific CDN */
input[name="csrf_token"]  /* the CSRF input, exact match */

Pseudo-classes for position

The position pseudo-classes are what let you say "the third product" or "every other row":

Pseudo-class	Matches
`:first-child`	First child of its parent
`:last-child`	Last child of its parent
`:nth-child(n)`	The Nth child (1-indexed)
`:nth-child(2n)`	Every even-positioned child
`:nth-child(2n+1)`	Every odd-positioned child
`:nth-last-child(n)`	Nth counting from the end
`:first-of-type`	First of that tag among siblings
`:last-of-type`	Last of that tag among siblings
`:nth-of-type(n)`	Nth of that tag, what you usually want

The :nth-child vs :nth-of-type distinction trips everyone up. :nth-child(1) means "the first child of the parent, if it happens to be this tag." :nth-of-type(1) means "the first sibling that IS this tag." When the parent has mixed children, you almost always want :nth-of-type.

Pseudo-classes that aren't position

Pseudo-class	Matches	Scraping use
`:not(selector)`	Elements NOT matching the inner selector	Exclude promoted items: `.product:not(.sponsored)`
`:has(selector)`	Elements that contain a matching descendant	`tr:has(td.in-stock)` (newer; supported in Playwright, lxml 5+)
`:contains("text")`	Elements containing text (jQuery extension, BeautifulSoup `string=`, Playwright `text=`)	Brittle, use sparingly
`:empty`	Elements with no children	Identify placeholder rows

:not() is the workhorse. :contains() is not standard CSS, different libraries spell it differently:

# BeautifulSoup
soup.select('h2:-soup-contains("Free shipping")')

# Playwright
page.locator("h2", has_text="Free shipping")

# lxml.cssselect doesn't support :contains at all, use XPath instead

Treat text-based matching as a last resort; prefer class, id, or data-*.

Putting it together: a real selector

Suppose you want to extract the price of every non-promoted product on a listing, but only those in stock:

article.product:not(.sponsored):has(.in-stock-badge) .price

Read it left to right:

article.product, every product card
:not(.sponsored), except the promoted ones
:has(.in-stock-badge), that contain an in-stock badge
.price, descendant .price

One line, ~40 characters, replaces 15 lines of imperative DOM-walking.

Writing resilient selectors

A selector that breaks the first time the site redesigns is technical debt. Two rules:

Prefer semantic anchors. article.product, [data-product-id], and h1 are stable. div.css-1xyz9 (auto-generated by a CSS-in-JS framework) is not, that class name changes on every deploy.
Anchor short, not long. .product .price survives more changes than body > div.layout > main > section.products > div.row > article.product > div.body > p.price. Each extra layer is a brittleness point.

The Catalog108 challenge pages deliberately use a mix of semantic and non-semantic markup so you can practise picking durable anchors.

Hands-on lab

Open practice.scrapingcentral.com/challenges/static/lists/cards and write a single CSS selector that grabs every card title. Then write a selector that picks only the cards marked "featured", and another that excludes them. Run all three with BeautifulSoup's select() and verify the counts. Then come back when you've read the XPath lesson and try the same in XPath, comparing the two against the same markup is the fastest way to internalize both.

CSS Selectors, Complete Reference

What you’ll learn

The five basics

The combinators

Attribute selectors with operators

Pseudo-classes for position

Pseudo-classes that aren't position

Putting it together: a real selector

Writing resilient selectors

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which selector picks elements with BOTH the 'btn' and 'primary' classes?