Locator Strategies: CSS, XPath, Role, Text, Test-ID
Choosing the right selector type is the single biggest factor in scraper stability. A clear hierarchy of which to prefer, when, and why.
What you’ll learn
- Rank selector strategies from most to least stable for production scrapers.
- Write equivalent locators in CSS, XPath, role-based, and text-based forms.
- Use chained scoping (`locator > locator`) instead of fragile compound selectors.
- Spot the three brittle patterns that break on every site redesign.
Every scraper bug eventually boils down to a selector that broke. Choosing the right selector type for the right job is the single biggest factor in long-term scraper stability. This lesson is the hierarchy you should internalise.
The stability hierarchy
Ordered from most stable to least:
data-testidattributes, engineers add these explicitly so tests can target them. They rarely change.- ARIA roles + accessible names, anchored in semantic HTML, survive cosmetic redesigns.
- Visible text content, humans see and react to text, so it changes less aggressively than CSS classes.
- Stable structural CSS,
<main>,<article>, semantic landmarks. - Class-name CSS, fine when classes are meaningful (
.product-card), bad when they're generated (.css-3a7e1b). - XPath with positional logic,
div[2]/div[1]style, breaks on any DOM reshuffle. - Coordinates / pixel positions, never use unless you have no choice.
Use the highest level that works. Drop a step only when you must.
CSS selectors
Playwright's default selector engine. Most scraping selectors should be CSS:
page.locator(".product-card") # class
page.locator("#main-nav") # id
page.locator("a[href*='/products/']") # attribute contains
page.locator("article > h2") # direct child
page.locator("[data-status='in-stock']") # custom attribute
page.locator(".product-card:has(.sale-badge)") # :has() pseudo-class
CSS is fast and well-supported. The :has() pseudo-class is a recent addition that handles "parent matching child" cleanly, formerly the only place XPath beat CSS.
XPath: when to use it
XPath is more powerful than CSS in two cases:
# 1. Match by visible text content
page.locator("xpath=//button[normalize-space()='Add to cart']")
# 2. Walk up the tree (`ancestor::`)
page.locator("xpath=//span[text()='Price']/ancestor::tr[1]")
Playwright auto-detects XPath when the selector starts with /, //, or xpath=. For everything XPath uniquely solves, CSS now has equivalents:
| Goal | XPath | CSS equivalent |
|---|---|---|
| Match text | //button[text()='X'] |
text="X" (Playwright extension) |
| Has child | //div[.//span] |
div:has(span) |
| Direct child | /div/h1 |
div > h1 |
Use XPath only when you need true tree-walking. For everything else, prefer CSS or text matchers.
Role-based locators
Role selectors target the semantic structure of the page, what the element means, not how it looks:
page.get_by_role("button", name="Add to cart")
page.get_by_role("link", name="See all products")
page.get_by_role("heading", name="Featured")
page.get_by_role("textbox", name="Search")
page.get_by_role("checkbox", name="Accept terms")
page.get_by_role("listitem")
The name argument matches the accessible name, usually the visible label, sometimes an aria-label. Roles are stable across redesigns because they're tied to the element's purpose, not its CSS.
Roles are also forgiving: a <button> and a <div role="button"> both match get_by_role("button").
Text-based locators
page.get_by_text("Add to cart") # exact-ish
page.get_by_text("Add to", exact=False) # substring
page.get_by_text(re.compile(r"^Add to ")) # regex
get_by_text matches against visible (rendered) text. Useful for clicking elements whose only stable identifier is the words inside them.
Caveats:
- It matches the closest ancestor containing the text. If your "Add to cart" string is in a
<span>inside a<button>, you'll get the button (usually what you want). - For non-button elements, prefer
get_by_roleif available, text appears more places than you think.
Test-ID locators
<button data-testid="submit-order">Order now</button>
page.get_by_test_id("submit-order")
When a site has data-testid attributes, use them. They're added precisely so testing tools (like yours) can target elements stably. The attribute survives most code changes because engineers know "tests reference this string."
Configure the attribute name if the site uses something else (data-test, data-cy):
playwright.selectors.set_test_id_attribute("data-test")
Real example: same element, five ways
The "Add to cart" button on /products/1-white-wooden-vase:
# 1. Test-ID (best if present)
page.get_by_test_id("add-to-cart")
# 2. Role + accessible name (recommended default)
page.get_by_role("button", name="Add to cart")
# 3. Text
page.get_by_text("Add to cart")
# 4. CSS by class
page.locator("button.add-to-cart")
# 5. XPath
page.locator("xpath=//button[normalize-space()='Add to cart']")
If the site redesigns its CSS, #4 breaks. If the wording changes to "Add to bag", #3 and #5 break. If the DOM structure changes around it, only #1 and #2 are still rock-solid.
Three brittle patterns to never write
nth-child/nth-of-typebased on position.
page.locator("ul > li:nth-child(3) > a") # breaks if a row is added
- Generated class names.
page.locator(".sc-bdVaJa.kxjJDU") # breaks every CSS rebuild
These are CSS-in-JS hashes. They change every deployment.
- Long XPath positional chains.
page.locator("xpath=/html/body/div/div[2]/div[1]/div/a") # breaks on any structural change
Auto-generated by "Copy XPath" in DevTools. Always rewrite before committing.
Scoping with chained locators
Instead of one long CSS selector, chain locators:
# Bad: one fragile compound selector
page.locator("table tbody tr:has-text('Yellow Mug') td:nth-child(3) button")
# Good: layered, each step has meaning
row = page.locator("tr").filter(has_text="Yellow Mug")
row.locator("button", has_text="Delete").click()
Chained locators document intent and isolate failure: if the second step breaks, you know exactly which scope didn't resolve.
Hands-on lab
Open /products/1-white-wooden-vase. Write five different locators for the "Add to cart" button, one each of: test-id, role, text, CSS, XPath. Use Playwright's page.locator(sel).count() to verify each resolves to exactly one element. Note which selectors are most readable. That readability is your future-self thanking you.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/products/1-white-wooden-vaseQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.