Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

2.23intermediate5 min read

Capturing XHR / Fetch Calls the Page Makes

The defining browser-automation pattern: drive the page just enough to discover the underlying API, then bypass the browser entirely. This is how production scrapers get fast.

What you’ll learn

  • Capture every XHR/Fetch request the page makes with Playwright's `request`/`response` events.
  • Filter to the calls that return the data you actually want.
  • Extract request headers, auth tokens, and payload shapes for replay.
  • Promote the captured call into a pure HTTP scraper.

This is the single most valuable browser-automation pattern: use the browser to discover the underlying API, then drop the browser and scrape with requests. You pay the browser cost once during reconnaissance; production runs at HTTP speed. Most experienced scrapers spend their first hour on any new target doing exactly this.

The mindset

Modern sites are usually three pieces stacked:

  1. A skeleton HTML page.
  2. JavaScript that fetches data.
  3. The data, as JSON, from an internal API.

Your scraper wants piece 3. Driving the browser to render piece 1 + run piece 2 is the slow path. Hitting piece 3 directly is the fast path. The browser is the discovery tool.

Playwright's network events

Three events expose every network call:

page.on("request", lambda r: print("→", r.method, r.url))
page.on("response", lambda r: print("←", r.status, r.url))
page.on("requestfinished", lambda r: print("✓", r.url))

request fires when a request is issued. response fires when headers come back. requestfinished fires when the body has fully arrived. For scraping, response is usually what you want, it carries the status and lets you read the body.

Filtering and collecting

You almost always want a subset. Filter on the URL pattern:

captured = []

def on_response(response):
  if "/api/" in response.url and response.status == 200:
  try:
  captured.append({
  "url": response.url,
  "method": response.request.method,
  "headers": response.request.headers,
  "data": response.json(),
  })
  except Exception:
  pass  # not JSON, ignore

page.on("response", on_response)

page.goto("https://practice.scrapingcentral.com/locations")
page.wait_for_load_state("networkidle")  # let all XHRs fire

for c in captured:
  print(c["method"], c["url"], "→", len(c["data"]), "items")

After the page loads, captured contains every JSON response from /api/. Inspect them, one will be the data you want.

networkidle is dangerous on streaming-heavy pages (Lesson 2.9), but for "fire once and finish" XHR-driven pages, it's the right primitive.

The expect_response shortcut

For a single, known endpoint:

with page.expect_response("**/api/locations*") as resp_info:
  page.goto("https://practice.scrapingcentral.com/locations")

response = resp_info.value
data = response.json()
print(len(data["locations"]), "locations")

expect_response returns the first response matching the pattern. The with block executes your trigger (goto); the response is available after exit.

Use expect_response when you already know the URL pattern. Use the on("response"...) collector when you're still in discovery mode and want to see all the calls.

Extracting auth tokens

Modern APIs require headers, Authorization: Bearer ..., X-CSRF-Token: ..., X-Requested-With: XMLHttpRequest. Capture them from a real call:

def grab_auth(response):
  headers = response.request.headers
  if "authorization" in headers:
  print("Auth:", headers["authorization"])
  if "x-csrf-token" in headers:
  print("CSRF:", headers["x-csrf-token"])

page.on("response", grab_auth)
page.goto(url)

Now you have the tokens to reproduce the call with requests. Some tokens are session-scoped (rotated per page load) and some are user-scoped (tied to a login). Either way, capture them once via the browser and feed them to your HTTP scraper.

Promoting to HTTP

The promotion workflow:

  1. Drive the page with Playwright; capture the responses.
  2. Pick the response containing your data.
  3. Build a requests call replicating the request URL, method, headers, and (if POST) body.
  4. Verify the bare requests call returns the same data.
  5. Delete the browser code.

Pseudo-code:

import requests

# Captured during reconnaissance:
url = "https://practice.scrapingcentral.com/api/locations?bbox=-122.5,37.7,-122.3,37.8"
headers = {
  "User-Agent": "Mozilla/5.0 ...",
  "Accept": "application/json",
  "X-Requested-With": "XMLHttpRequest",
}

r = requests.get(url, headers=headers)
r.raise_for_status()
data = r.json()

If data matches what you captured, you're done. Browser retired. Production scraper runs at HTTP speed.

When the API isn't reproducible

Three reasons the captured call won't replay:

  1. Anti-replay token. A short-lived nonce in the request body or header that the server invalidates after one use. Walk back to the call that produced the nonce; if reproducible, replay both in sequence. Sub-Path 4 covers reverse-engineering token generation.
  2. Server-side TLS/JS fingerprinting. The server checks your client fingerprint (TLS handshake, browser JS state) and rejects raw requests even with the right headers. Sub-Path 5.
  3. Signed URLs with browser-specific keys. The URL itself was signed by JS that ran in the page. You can either reverse the signing logic or keep using the browser for that one call.

In each case, you may still be able to promote part of the scrape, drive a browser for the auth handshake, then HTTP for everything else.

Catalog108 walkthrough: /locations

/locations renders a map of physical store locations. It fetches the location list via XHR:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch()
  page = browser.new_page()

  with page.expect_response("**/api/locations*") as ri:
  page.goto("https://practice.scrapingcentral.com/locations")

  data = ri.value.json()
  for loc in data["locations"][:3]:
  print(loc["name"], loc["lat"], loc["lng"])

  browser.close()

Then promote:

import requests
data = requests.get("https://practice.scrapingcentral.com/api/locations").json()
for loc in data["locations"][:3]:
  print(loc["name"], loc["lat"], loc["lng"])

Same data, ~30× faster.

Recording for later inspection

Save the full network log as HAR:

context = browser.new_context(record_har_path="trace.har")
page = context.new_page()
page.goto(url)
context.close()

The .har file has every request and response, load it in DevTools' Network tab or any HAR viewer. Useful when you can't reproduce a bug live and need to inspect what the browser saw.

Hands-on lab

Open /locations. Use Playwright to capture every XHR, print the URLs and statuses. Find the call that returns the location list. Extract the request headers. Then write a pure requests script that reproduces the call without a browser. Time both. The HTTP version should run in 100-200ms vs 1-2 seconds for the browser version. That speedup is what this lesson is for.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /locations

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Capturing XHR / Fetch Calls the Page Makes1 / 8

Why is capturing the XHR calls a page makes such a HIGH-LEVERAGE scraping technique?

Score so far: 0 / 0