Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.7intermediate4 min read

scrapy-playwright, Hybrid Scrapy + Browser

Add a real browser to Scrapy for the pages that need JavaScript, without throwing away the framework. The bridge between Sub-Path 3 and production-scale scraping.

What you’ll learn

  • Install and configure scrapy-playwright in an existing Scrapy project.
  • Mark per-request which URLs need a browser and which don't.
  • Use Playwright page methods (wait_for_selector, click) from inside a Scrapy callback.

Some pages can't be parsed without running JavaScript. You've seen this in Sub-Path 3, SPAs, lazy-loaded content, modals. The naive answer is "use Playwright everywhere," but browsers are slow and expensive. The pragmatic answer: use Scrapy for everything that works statically, and only spin up a browser for the URLs that need it.

scrapy-playwright is the bridge.

Install and configure

pip install scrapy-playwright
playwright install chromium

In settings.py:

DOWNLOAD_HANDLERS = {
  "http":  "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
  "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
  "headless": True,
}
PLAYWRIGHT_MAX_CONTEXTS = 8  # max concurrent browser contexts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30_000

That's the minimum. The asyncio reactor is required because Playwright is async.

Per-request opt-in

The key insight: not every request needs a browser. Use meta["playwright"] to mark only the ones that do.

import scrapy

class HybridSpider(scrapy.Spider):
  name = "hybrid"
  start_urls = ["https://practice.scrapingcentral.com/products"]

  def parse(self, response):
  # Listing page works fine statically. No browser needed.
  for href in response.css(".product-card a::attr(href)").getall():
  yield response.follow(href, self.parse_product)

  def parse_product(self, response):
  # Some product pages have JS-loaded reviews. Use Playwright for those.
  if "spa-pure" in response.url:
  yield response.follow(
  response.url,
  self.parse_spa,
  meta={"playwright": True, "playwright_include_page": True},
  dont_filter=True,
  )
  else:
  yield {"url": response.url, "title": response.css("h1::text").get()}

  async def parse_spa(self, response):
  page = response.meta["playwright_page"]
  await page.wait_for_selector(".product-loaded")
  yield {
  "url": response.url,
  "title": await page.locator("h1").inner_text(),
  "price": await page.locator(".price").inner_text(),
  }
  await page.close()

A few important details:

  • meta["playwright"]=True tells the download handler to use the browser.
  • meta["playwright_include_page"]=True keeps the page object alive after the response, you can interact with it in the callback.
  • The callback becomes async def. You can await Playwright methods directly.
  • Always close the page yourself. await page.close() at the end of the callback, otherwise contexts leak and you'll OOM.

PageMethods for declarative actions

If your callback only needs "wait for selector, then return HTML" you can declare actions inline without an async callback:

from scrapy_playwright.page import PageMethod

yield scrapy.Request(
  url="https://practice.scrapingcentral.com/challenges/dynamic/spa-pure",
  meta={
  "playwright": True,
  "playwright_page_methods": [
  PageMethod("wait_for_selector", ".product-card"),
  PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
  PageMethod("wait_for_selector", ".product-card:nth-child(20)"),
  ],
  },
  callback=self.parse_loaded,
)

PageMethods run before your callback receives the response. The callback then sees the fully rendered HTML, same as a normal Scrapy callback, no async required.

Costs and limits

Browser-backed requests cost roughly:

  • 200–500ms per page even with a warm context (cold starts are slower)
  • ~150 MB RAM per context
  • 1 CPU core per ~5 concurrent browsers (depends on page complexity)

A pure-Scrapy spider can do thousands of req/min on a small server. A Playwright spider does 60–300 req/min depending on hardware. Save browser-mode for pages that genuinely need it.

The numbers are why hybrid mode wins: scrape 95% of URLs statically at full speed, swap to a browser for the 5% that need it.

Contexts for isolation

Each meta["playwright_context"] value gets its own Chromium browser context (= profile = cookie jar = isolated localStorage):

yield scrapy.Request(
  url=...,
  meta={
  "playwright": True,
  "playwright_context": "user_42",
  },
)

Different playwright_context values are different "users" from the server's perspective. Use this when running multiple logged-in identities, just like Scrapy's cookiejar.

Configure context defaults globally:

PLAYWRIGHT_CONTEXTS = {
  "default": {
  "viewport": {"width": 1280, "height": 800},
  "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 ...",
  "locale": "en-US",
  }
}

When to skip scrapy-playwright

  • The whole site needs JS: just use Playwright directly. Scrapy's machinery overhead isn't worth it when every page is browser-mode.
  • You're doing form login + 5 clicks: write a Playwright script, save state, then come back to Scrapy with the cookies.
  • You need full network interception with response mocking: pure Playwright is more flexible.

Hands-on lab

Against /challenges/dynamic/spa-pure:

  1. Confirm the page is JS-rendered by running curl, the .product-card elements aren't in the HTML.
  2. Write a Scrapy spider with scrapy-playwright that uses PageMethod("wait_for_selector", ".product-card") to wait for the SPA to hydrate, then extracts product data using normal CSS selectors on response.
  3. Run with CONCURRENT_REQUESTS = 4 and PLAYWRIGHT_MAX_CONTEXTS = 4. Note throughput vs a pure-static spider.

The exercise is to feel the latency cost first-hand. It's why "use Scrapy first, browser only when needed" is the right default.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/dynamic/spa-pure

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

scrapy-playwright, Hybrid Scrapy + Browser1 / 8

Why does scrapy-playwright require `TWISTED_REACTOR` to be the AsyncioSelectorReactor?

Score so far: 0 / 0