Scraping with Playwright in Python
A comprehensive guide to web scraping with Playwright in Python, covering sync and async APIs, data extraction patterns, and exporting results.
Playwright's Python bindings offer both synchronous and asynchronous APIs, making it flexible for simple scripts and production scraping pipelines alike. This guide covers practical patterns for extracting data, handling pagination, and exporting results.
Sync vs Async API
Playwright for Python provides two APIs. The sync API is simpler for scripts and small tasks. The async API is better for performance-critical applications.
Sync API:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com")
print(page.title())
browser.close()
Async API:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://quotes.toscrape.com")
print(await page.title())
await browser.close()
asyncio.run(main())
Complete Scraping Pipeline
Here is a full example that scrapes multiple pages and exports data to CSV:
import csv
from playwright.sync_api import sync_playwright
def scrape_quotes():
all_quotes = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Block images for faster loading
page.route("**/*.{png,jpg,jpeg,gif,svg}", lambda route: route.abort())
page_num = 1
while True:
url = f"https://quotes.toscrape.com/js/page/{page_num}/"
page.goto(url)
# Wait for quotes to render
try:
page.wait_for_selector(".quote", timeout=5000)
except:
break # No more pages
quotes = page.query_selector_all(".quote")
if not quotes:
break
for quote in quotes:
text = quote.query_selector(".text").inner_text()
author = quote.query_selector(".author").inner_text()
tags = [
tag.inner_text()
for tag in quote.query_selector_all(".tag")
]
all_quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags),
"page": page_num
})
page_num += 1
browser.close()
return all_quotes
# Run the scraper
quotes = scrape_quotes()
# Export to CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["text", "author", "tags", "page"])
writer.writeheader()
writer.writerows(quotes)
print(f"Scraped {len(quotes)} quotes across multiple pages")
Extracting Structured Data with evaluate
Use page.evaluate to run JavaScript and return structured data directly:
data = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.quote')).map(el => ({
text: el.querySelector('.text').innerText,
author: el.querySelector('.author').innerText,
tags: Array.from(el.querySelectorAll('.tag')).map(t => t.innerText)
}));
}
""")
Error Handling and Retries
Production scrapers need robust error handling:
import time
def scrape_with_retry(page, url, max_retries=3):
for attempt in range(max_retries):
try:
page.goto(url, timeout=30000)
page.wait_for_selector(".content", timeout=10000)
return page.content()
except Exception as e:
print(f"Attempt {attempt + 1} failed for {url}: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
return None
Using Playwright with BeautifulSoup
Combine Playwright's rendering with BeautifulSoup's parsing:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/")
page.wait_for_selector(".quote")
# Get rendered HTML and parse with BeautifulSoup
html = page.content()
soup = BeautifulSoup(html, "html.parser")
for quote in soup.select(".quote"):
print(quote.select_one(".text").get_text())
browser.close()
When to Use a Managed Service
For large-scale Python scraping pipelines, managing Playwright browsers on your servers means handling browser crashes, memory leaks, and scaling challenges. ScraperAPI offers a Python SDK that renders JavaScript pages via a simple API call. ScrapingAnt provides similar functionality with a Python client library.
Next Steps
- Set up Selenium Grid for distributed scraping
- Learn anti-detection techniques
- Compare Playwright vs Selenium vs Puppeteer