Capturing XHR / Fetch Calls the Page Makes, Dynamic Web & Browser Automation

The defining browser-automation pattern: drive the page just enough to discover the underlying API, then bypass the browser entirely. This is how production scrapers get fast.

This is the single most valuable browser-automation pattern: use the browser to discover the underlying API, then drop the browser and scrape with requests. You pay the browser cost once during reconnaissance; production runs at HTTP speed. Most experienced scrapers spend their first hour on any new target doing exactly this.

The mindset

Modern sites are usually three pieces stacked:

A skeleton HTML page.
JavaScript that fetches data.
The data, as JSON, from an internal API.

Your scraper wants piece 3. Driving the browser to render piece 1 + run piece 2 is the slow path. Hitting piece 3 directly is the fast path. The browser is the discovery tool.

Playwright's network events

Three events expose every network call:

page.on("request", lambda r: print("→", r.method, r.url))
page.on("response", lambda r: print("←", r.status, r.url))
page.on("requestfinished", lambda r: print("✓", r.url))

request fires when a request is issued. response fires when headers come back. requestfinished fires when the body has fully arrived. For scraping, response is usually what you want, it carries the status and lets you read the body.

Filtering and collecting

You almost always want a subset. Filter on the URL pattern:

captured = []

def on_response(response):
  if "/api/" in response.url and response.status == 200:
  try:
  captured.append({
  "url": response.url,
  "method": response.request.method,
  "headers": response.request.headers,
  "data": response.json(),
  })
  except Exception:
  pass  # not JSON, ignore

page.on("response", on_response)

page.goto("https://practice.scrapingcentral.com/locations")
page.wait_for_load_state("networkidle")  # let all XHRs fire

for c in captured:
  print(c["method"], c["url"], "→", len(c["data"]), "items")

After the page loads, captured contains every JSON response from /api/. Inspect them, one will be the data you want.

networkidle is dangerous on streaming-heavy pages (Lesson 2.9), but for "fire once and finish" XHR-driven pages, it's the right primitive.

The `expect_response` shortcut

For a single, known endpoint:

with page.expect_response("**/api/locations*") as resp_info:
  page.goto("https://practice.scrapingcentral.com/locations")

response = resp_info.value
data = response.json()
print(len(data["locations"]), "locations")

expect_response returns the first response matching the pattern. The with block executes your trigger (goto); the response is available after exit.

Use expect_response when you already know the URL pattern. Use the on("response"...) collector when you're still in discovery mode and want to see all the calls.

Extracting auth tokens

Modern APIs require headers, Authorization: Bearer ..., X-CSRF-Token: ..., X-Requested-With: XMLHttpRequest. Capture them from a real call:

def grab_auth(response):
  headers = response.request.headers
  if "authorization" in headers:
  print("Auth:", headers["authorization"])
  if "x-csrf-token" in headers:
  print("CSRF:", headers["x-csrf-token"])

page.on("response", grab_auth)
page.goto(url)

Now you have the tokens to reproduce the call with requests. Some tokens are session-scoped (rotated per page load) and some are user-scoped (tied to a login). Either way, capture them once via the browser and feed them to your HTTP scraper.

Promoting to HTTP

The promotion workflow:

Drive the page with Playwright; capture the responses.
Pick the response containing your data.
Build a requests call replicating the request URL, method, headers, and (if POST) body.
Verify the bare requests call returns the same data.
Delete the browser code.

Pseudo-code:

import requests

# Captured during reconnaissance:
url = "https://practice.scrapingcentral.com/api/locations?bbox=-122.5,37.7,-122.3,37.8"
headers = {
  "User-Agent": "Mozilla/5.0 ...",
  "Accept": "application/json",
  "X-Requested-With": "XMLHttpRequest",
}

r = requests.get(url, headers=headers)
r.raise_for_status()
data = r.json()

If data matches what you captured, you're done. Browser retired. Production scraper runs at HTTP speed.

When the API isn't reproducible

Three reasons the captured call won't replay:

Anti-replay token. A short-lived nonce in the request body or header that the server invalidates after one use. Walk back to the call that produced the nonce; if reproducible, replay both in sequence. Sub-Path 4 covers reverse-engineering token generation.
Server-side TLS/JS fingerprinting. The server checks your client fingerprint (TLS handshake, browser JS state) and rejects raw requests even with the right headers. Sub-Path 5.
Signed URLs with browser-specific keys. The URL itself was signed by JS that ran in the page. You can either reverse the signing logic or keep using the browser for that one call.

In each case, you may still be able to promote part of the scrape, drive a browser for the auth handshake, then HTTP for everything else.

Catalog108 walkthrough: /locations

/locations renders a map of physical store locations. It fetches the location list via XHR:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  with page.expect_response("**/api/locations*") as ri:
  page.goto("https://practice.scrapingcentral.com/locations")

  data = ri.value.json()
  for loc in data["locations"][:3]:
  print(loc["name"], loc["lat"], loc["lng"])

  browser.close()

Then promote:

import requests
data = requests.get("https://practice.scrapingcentral.com/api/locations").json()
for loc in data["locations"][:3]:
  print(loc["name"], loc["lat"], loc["lng"])

Same data, ~30× faster.

Recording for later inspection

Save the full network log as HAR:

context = browser.new_context(record_har_path="trace.har")
page = context.new_page()
page.goto(url)
context.close()

The .har file has every request and response, load it in DevTools' Network tab or any HAR viewer. Useful when you can't reproduce a bug live and need to inspect what the browser saw.

Hands-on lab

Open /locations. Use Playwright to capture every XHR, print the URLs and statuses. Find the call that returns the location list. Extract the request headers. Then write a pure requests script that reproduces the call without a browser. Time both. The HTTP version should run in 100-200ms vs 1-2 seconds for the browser version. That speedup is what this lesson is for.

Capturing XHR / Fetch Calls the Page Makes

What you’ll learn

The mindset

Playwright's network events

Filtering and collecting

The `expect_response` shortcut

Extracting auth tokens

Promoting to HTTP

When the API isn't reproducible

Catalog108 walkthrough: /locations

Recording for later inspection

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why is capturing the XHR calls a page makes such a HIGH-LEVERAGE scraping technique?

Capturing XHR / Fetch Calls the Page Makes

What you’ll learn

The mindset

Playwright's network events

Filtering and collecting

The expect_response shortcut

Extracting auth tokens

Promoting to HTTP

When the API isn't reproducible

Catalog108 walkthrough: /locations

Recording for later inspection

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why is capturing the XHR calls a page makes such a HIGH-LEVERAGE scraping technique?

The `expect_response` shortcut