Scraping with Playwright in Python - Browser Automation

A comprehensive guide to web scraping with Playwright in Python, covering sync and async APIs, data extraction patterns, and exporting results.

Playwright's Python bindings offer both synchronous and asynchronous APIs, making it flexible for simple scripts and production scraping pipelines alike. This guide covers practical patterns for extracting data, handling pagination, and exporting results.

Sync vs Async API

Playwright for Python provides two APIs. The sync API is simpler for scripts and small tasks. The async API is better for performance-critical applications.

Sync API:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com")
    print(page.title())
    browser.close()

Async API:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://quotes.toscrape.com")
        print(await page.title())
        await browser.close()

asyncio.run(main())

Complete Scraping Pipeline

Here is a full example that scrapes multiple pages and exports data to CSV:

import csv
from playwright.sync_api import sync_playwright

def scrape_quotes():
    all_quotes = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Block images for faster loading
        page.route("**/*.{png,jpg,jpeg,gif,svg}", lambda route: route.abort())

        page_num = 1
        while True:
            url = f"https://quotes.toscrape.com/js/page/{page_num}/"
            page.goto(url)

            # Wait for quotes to render
            try:
                page.wait_for_selector(".quote", timeout=5000)
            except:
                break  # No more pages

            quotes = page.query_selector_all(".quote")
            if not quotes:
                break

            for quote in quotes:
                text = quote.query_selector(".text").inner_text()
                author = quote.query_selector(".author").inner_text()
                tags = [
                    tag.inner_text()
                    for tag in quote.query_selector_all(".tag")
                ]
                all_quotes.append({
                    "text": text,
                    "author": author,
                    "tags": ", ".join(tags),
                    "page": page_num
                })

            page_num += 1

        browser.close()

    return all_quotes

# Run the scraper
quotes = scrape_quotes()

# Export to CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["text", "author", "tags", "page"])
    writer.writeheader()
    writer.writerows(quotes)

print(f"Scraped {len(quotes)} quotes across multiple pages")

Extracting Structured Data with evaluate

Use page.evaluate to run JavaScript and return structured data directly:

data = page.evaluate("""
    () => {
        return Array.from(document.querySelectorAll('.quote')).map(el => ({
            text: el.querySelector('.text').innerText,
            author: el.querySelector('.author').innerText,
            tags: Array.from(el.querySelectorAll('.tag')).map(t => t.innerText)
        }));
    }
""")

Error Handling and Retries

Production scrapers need robust error handling:

import time

def scrape_with_retry(page, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            page.goto(url, timeout=30000)
            page.wait_for_selector(".content", timeout=10000)
            return page.content()
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for {url}: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
    return None

Using Playwright with BeautifulSoup

Combine Playwright's rendering with BeautifulSoup's parsing:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    page.wait_for_selector(".quote")

    # Get rendered HTML and parse with BeautifulSoup
    html = page.content()
    soup = BeautifulSoup(html, "html.parser")

    for quote in soup.select(".quote"):
        print(quote.select_one(".text").get_text())

    browser.close()

When to Use a Managed Service

For large-scale Python scraping pipelines, managing Playwright browsers on your servers means handling browser crashes, memory leaks, and scaling challenges. ScraperAPI offers a Python SDK that renders JavaScript pages via a simple API call. ScrapingAnt provides similar functionality with a Python client library.

Next Steps

Set up Selenium Grid for distributed scraping
Learn anti-detection techniques
Compare Playwright vs Selenium vs Puppeteer