Playwright Install + First Script (Python)
Install Playwright, drive a real browser, screenshot a page, extract text, the minimum viable browser-automation pipeline.
What you’ll learn
- Install `playwright` and its bundled Chromium/Firefox/WebKit binaries.
- Write a script that launches a browser, opens a page, and extracts content.
- Distinguish sync vs async API in Python, and choose between them.
- Run the same script in headed mode for debugging and headless mode for production.
Playwright is the modern standard for browser automation: faster than Selenium, more reliable than Puppeteer's Python ports, and maintained by a team at Microsoft. This lesson gets you running.
Install
Playwright ships in two halves: the Python library and the browser binaries.
pip install playwright
playwright install chromium
The first command installs the playwright package. The second downloads a known-good Chromium build into ~/.cache/ms-playwright/. You can also install Firefox and WebKit:
playwright install firefox webkit
Or all three at once with playwright install (no argument). Most production scrapers stick to Chromium, it's the fastest and most-tested. Firefox and WebKit are useful for cross-browser bug repros, not daily scraping.
Verify the install
python -c "from playwright.sync_api import sync_playwright; print('ok')"
If that prints ok, you're done. If you see an import error, you missed the pip install. If you see "executable doesn't exist", you missed playwright install chromium.
Your first scraper
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://practice.scrapingcentral.com/")
print(page.title())
print(page.locator("h1").first.inner_text())
browser.close()
Run it. You should see the page title and the <h1> text. That is a working Playwright scraper. Everything else in this sub-path is variations on these seven lines.
What each line does
with sync_playwright() as p:, starts the Playwright supervisor process. Thewithblock guarantees clean shutdown.p.chromium.launch(headless=True), spawns a Chromium instance.headless=Falseopens a real visible window (debugging).browser.new_page(), creates a fresh page (tab) inside the default browser context.page.goto(url), navigates and waits until the network is mostly idle. Returns aResponseobject.page.locator("h1").first.inner_text(), queries the DOM, picks the first match, returns its text.browser.close(), terminates the browser process.
Compared to requests, the new ideas are: launch a process, open a page, query with locators, close cleanly. That is the whole API surface at the top level.
Headed vs headless
browser = p.chromium.launch(headless=False, slow_mo=500)
headless=False opens a visible browser window. slow_mo=500 delays every action by 500ms so you can see what the scraper is doing. Both are debugging aids, turn them off for production.
A common pattern is to read these from environment variables:
import os
headless = os.environ.get("HEADLESS", "1") == "1"
slow_mo = int(os.environ.get("SLOW_MO", "0"))
browser = p.chromium.launch(headless=headless, slow_mo=slow_mo)
Then HEADLESS=0 SLOW_MO=300 python scrape.py flips into debug mode without code changes.
Sync vs async, which to use
Playwright Python has two APIs:
# Sync
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
...
# Async
from playwright.async_api import async_playwright
async with async_playwright() as p:
...
Use sync for:
- Scripts you run from the command line.
- Code inside a Jupyter notebook (sometimes, depends on the kernel).
- Anywhere that doesn't already have an event loop.
Use async for:
- Scrapers that drive multiple pages concurrently inside one Python process (Lesson 2.26).
- Integration with async frameworks (FastAPI, aiohttp, Scrapy with asyncio reactor).
- Anywhere you need to interleave Playwright calls with other async I/O.
For learning purposes, start with sync. It's strictly simpler. You can swap to async later when concurrency demands it.
Pulling more data
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://practice.scrapingcentral.com/")
# Every link on the page
for a in page.locator("a[href]").all():
text = a.inner_text().strip()
href = a.get_attribute("href")
print(f"{text!r:30} → {href}")
# Take a screenshot
page.screenshot(path="home.png", full_page=True)
browser.close()
Three new things:
page.locator(...).all()returns a list of all matching elements you can iterate.get_attribute("href")reads an attribute (vsinner_text()for the rendered text).page.screenshot(...)saves a PNG.full_page=Truecaptures the whole scrollable area, not just the viewport.
When goto returns
page.goto(url) waits for the load event by default, the browser has fired DOMContentLoaded and most resources have loaded. You can change it:
page.goto(url, wait_until="domcontentloaded") # earliest: HTML parsed
page.goto(url, wait_until="load") # default: most resources loaded
page.goto(url, wait_until="networkidle") # latest: 500ms with no network activity
page.goto(url, wait_until="commit") # earliest possible: response received
networkidle is tempting but unreliable on sites with long-poll connections, analytics beacons, or live-update streams, those never go idle. Prefer domcontentloaded plus an explicit wait for the element you actually need. Lesson 2.9 covers the full waiting strategy.
Cleanup matters
The with block ensures the browser closes even if your code throws. Without it:
p = sync_playwright().start()
browser = p.chromium.launch()
# ... if anything below raises, browser stays alive ...
browser.close()
p.stop()
You will leak Chromium processes. They will eat your RAM. Use the context manager.
Hands-on lab
Install Playwright and run the seven-line script against https://practice.scrapingcentral.com/. Confirm you get the page title and an h1. Then switch to headless=False, slow_mo=400, re-run, and watch the browser actually do the work. That visual feedback is invaluable while you're learning.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.