scrapy-playwright, Hybrid Scrapy + Browser, Production, Scale & Career

Add a real browser to Scrapy for the pages that need JavaScript, without throwing away the framework. The bridge between Sub-Path 3 and production-scale scraping.

Some pages can't be parsed without running JavaScript. You've seen this in Sub-Path 3, SPAs, lazy-loaded content, modals. The naive answer is "use Playwright everywhere," but browsers are slow and expensive. The pragmatic answer: use Scrapy for everything that works statically, and only spin up a browser for the URLs that need it.

scrapy-playwright is the bridge.

Install and configure

pip install scrapy-playwright
playwright install chromium

In settings.py:

DOWNLOAD_HANDLERS = {
  "http":  "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
  "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
  "headless": True,
}
PLAYWRIGHT_MAX_CONTEXTS = 8  # max concurrent browser contexts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30_000

That's the minimum. The asyncio reactor is required because Playwright is async.

Per-request opt-in

The key insight: not every request needs a browser. Use meta["playwright"] to mark only the ones that do.

import scrapy

class HybridSpider(scrapy.Spider):
  name = "hybrid"
  start_urls = ["https://practice.scrapingcentral.com/products"]

  def parse(self, response):
  # Listing page works fine statically. No browser needed.
  for href in response.css(".product-card a::attr(href)").getall():
  yield response.follow(href, self.parse_product)

  def parse_product(self, response):
  # Some product pages have JS-loaded reviews. Use Playwright for those.
  if "spa-pure" in response.url:
  yield response.follow(
  response.url,
  self.parse_spa,
  meta={"playwright": True, "playwright_include_page": True},
  dont_filter=True,
  )
  else:
  yield {"url": response.url, "title": response.css("h1::text").get()}

  async def parse_spa(self, response):
  page = response.meta["playwright_page"]
  await page.wait_for_selector(".product-loaded")
  yield {
  "url": response.url,
  "title": await page.locator("h1").inner_text(),
  "price": await page.locator(".price").inner_text(),
  }
  await page.close()

A few important details:

meta["playwright"]=True tells the download handler to use the browser.
meta["playwright_include_page"]=True keeps the page object alive after the response, you can interact with it in the callback.
The callback becomes async def. You can await Playwright methods directly.
Always close the page yourself. await page.close() at the end of the callback, otherwise contexts leak and you'll OOM.

PageMethods for declarative actions

If your callback only needs "wait for selector, then return HTML" you can declare actions inline without an async callback:

from scrapy_playwright.page import PageMethod

yield scrapy.Request(
  url="https://practice.scrapingcentral.com/challenges/dynamic/spa-pure",
  meta={
  "playwright": True,
  "playwright_page_methods": [
  PageMethod("wait_for_selector", ".product-card"),
  PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
  PageMethod("wait_for_selector", ".product-card:nth-child(20)"),
  ],
  },
  callback=self.parse_loaded,
)

PageMethods run before your callback receives the response. The callback then sees the fully rendered HTML, same as a normal Scrapy callback, no async required.

Costs and limits

Browser-backed requests cost roughly:

200–500ms per page even with a warm context (cold starts are slower)
~150 MB RAM per context
1 CPU core per ~5 concurrent browsers (depends on page complexity)

A pure-Scrapy spider can do thousands of req/min on a small server. A Playwright spider does 60–300 req/min depending on hardware. Save browser-mode for pages that genuinely need it.

The numbers are why hybrid mode wins: scrape 95% of URLs statically at full speed, swap to a browser for the 5% that need it.

Contexts for isolation

Each meta["playwright_context"] value gets its own Chromium browser context (= profile = cookie jar = isolated localStorage):

yield scrapy.Request(
  url=...,
  meta={
  "playwright": True,
  "playwright_context": "user_42",
  },
)

Different playwright_context values are different "users" from the server's perspective. Use this when running multiple logged-in identities, just like Scrapy's cookiejar.

Configure context defaults globally:

PLAYWRIGHT_CONTEXTS = {
  "default": {
  "viewport": {"width": 1280, "height": 800},
  "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 ...",
  "locale": "en-US",
  }
}

When to skip scrapy-playwright

The whole site needs JS: just use Playwright directly. Scrapy's machinery overhead isn't worth it when every page is browser-mode.
You're doing form login + 5 clicks: write a Playwright script, save state, then come back to Scrapy with the cookies.
You need full network interception with response mocking: pure Playwright is more flexible.

Hands-on lab

Against /challenges/dynamic/spa-pure:

Confirm the page is JS-rendered by running curl, the .product-card elements aren't in the HTML.
Write a Scrapy spider with scrapy-playwright that uses PageMethod("wait_for_selector", ".product-card") to wait for the SPA to hydrate, then extracts product data using normal CSS selectors on response.
Run with CONCURRENT_REQUESTS = 4 and PLAYWRIGHT_MAX_CONTEXTS = 4. Note throughput vs a pure-static spider.

The exercise is to feel the latency cost first-hand. It's why "use Scrapy first, browser only when needed" is the right default.

scrapy-playwright, Hybrid Scrapy + Browser

What you’ll learn

Install and configure

Per-request opt-in

PageMethods for declarative actions

Costs and limits

Contexts for isolation

When to skip scrapy-playwright

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why does scrapy-playwright require `TWISTED_REACTOR` to be the AsyncioSelectorReactor?