Capturing XHR / Fetch Calls the Page Makes
The defining browser-automation pattern: drive the page just enough to discover the underlying API, then bypass the browser entirely. This is how production scrapers get fast.
What you’ll learn
- Capture every XHR/Fetch request the page makes with Playwright's `request`/`response` events.
- Filter to the calls that return the data you actually want.
- Extract request headers, auth tokens, and payload shapes for replay.
- Promote the captured call into a pure HTTP scraper.
This is the single most valuable browser-automation pattern: use the browser to discover the underlying API, then drop the browser and scrape with requests. You pay the browser cost once during reconnaissance; production runs at HTTP speed. Most experienced scrapers spend their first hour on any new target doing exactly this.
The mindset
Modern sites are usually three pieces stacked:
- A skeleton HTML page.
- JavaScript that fetches data.
- The data, as JSON, from an internal API.
Your scraper wants piece 3. Driving the browser to render piece 1 + run piece 2 is the slow path. Hitting piece 3 directly is the fast path. The browser is the discovery tool.
Playwright's network events
Three events expose every network call:
page.on("request", lambda r: print("→", r.method, r.url))
page.on("response", lambda r: print("←", r.status, r.url))
page.on("requestfinished", lambda r: print("✓", r.url))
request fires when a request is issued. response fires when headers come back. requestfinished fires when the body has fully arrived. For scraping, response is usually what you want, it carries the status and lets you read the body.
Filtering and collecting
You almost always want a subset. Filter on the URL pattern:
captured = []
def on_response(response):
if "/api/" in response.url and response.status == 200:
try:
captured.append({
"url": response.url,
"method": response.request.method,
"headers": response.request.headers,
"data": response.json(),
})
except Exception:
pass # not JSON, ignore
page.on("response", on_response)
page.goto("https://practice.scrapingcentral.com/locations")
page.wait_for_load_state("networkidle") # let all XHRs fire
for c in captured:
print(c["method"], c["url"], "→", len(c["data"]), "items")
After the page loads, captured contains every JSON response from /api/. Inspect them, one will be the data you want.
networkidle is dangerous on streaming-heavy pages (Lesson 2.9), but for "fire once and finish" XHR-driven pages, it's the right primitive.
The expect_response shortcut
For a single, known endpoint:
with page.expect_response("**/api/locations*") as resp_info:
page.goto("https://practice.scrapingcentral.com/locations")
response = resp_info.value
data = response.json()
print(len(data["locations"]), "locations")
expect_response returns the first response matching the pattern. The with block executes your trigger (goto); the response is available after exit.
Use expect_response when you already know the URL pattern. Use the on("response"...) collector when you're still in discovery mode and want to see all the calls.
Extracting auth tokens
Modern APIs require headers, Authorization: Bearer ..., X-CSRF-Token: ..., X-Requested-With: XMLHttpRequest. Capture them from a real call:
def grab_auth(response):
headers = response.request.headers
if "authorization" in headers:
print("Auth:", headers["authorization"])
if "x-csrf-token" in headers:
print("CSRF:", headers["x-csrf-token"])
page.on("response", grab_auth)
page.goto(url)
Now you have the tokens to reproduce the call with requests. Some tokens are session-scoped (rotated per page load) and some are user-scoped (tied to a login). Either way, capture them once via the browser and feed them to your HTTP scraper.
Promoting to HTTP
The promotion workflow:
- Drive the page with Playwright; capture the responses.
- Pick the response containing your data.
- Build a
requestscall replicating the request URL, method, headers, and (if POST) body. - Verify the bare
requestscall returns the same data. - Delete the browser code.
Pseudo-code:
import requests
# Captured during reconnaissance:
url = "https://practice.scrapingcentral.com/api/locations?bbox=-122.5,37.7,-122.3,37.8"
headers = {
"User-Agent": "Mozilla/5.0 ...",
"Accept": "application/json",
"X-Requested-With": "XMLHttpRequest",
}
r = requests.get(url, headers=headers)
r.raise_for_status()
data = r.json()
If data matches what you captured, you're done. Browser retired. Production scraper runs at HTTP speed.
When the API isn't reproducible
Three reasons the captured call won't replay:
- Anti-replay token. A short-lived nonce in the request body or header that the server invalidates after one use. Walk back to the call that produced the nonce; if reproducible, replay both in sequence. Sub-Path 4 covers reverse-engineering token generation.
- Server-side TLS/JS fingerprinting. The server checks your client fingerprint (TLS handshake, browser JS state) and rejects raw
requestseven with the right headers. Sub-Path 5. - Signed URLs with browser-specific keys. The URL itself was signed by JS that ran in the page. You can either reverse the signing logic or keep using the browser for that one call.
In each case, you may still be able to promote part of the scrape, drive a browser for the auth handshake, then HTTP for everything else.
Catalog108 walkthrough: /locations
/locations renders a map of physical store locations. It fetches the location list via XHR:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
with page.expect_response("**/api/locations*") as ri:
page.goto("https://practice.scrapingcentral.com/locations")
data = ri.value.json()
for loc in data["locations"][:3]:
print(loc["name"], loc["lat"], loc["lng"])
browser.close()
Then promote:
import requests
data = requests.get("https://practice.scrapingcentral.com/api/locations").json()
for loc in data["locations"][:3]:
print(loc["name"], loc["lat"], loc["lng"])
Same data, ~30× faster.
Recording for later inspection
Save the full network log as HAR:
context = browser.new_context(record_har_path="trace.har")
page = context.new_page()
page.goto(url)
context.close()
The .har file has every request and response, load it in DevTools' Network tab or any HAR viewer. Useful when you can't reproduce a bug live and need to inspect what the browser saw.
Hands-on lab
Open /locations. Use Playwright to capture every XHR, print the URLs and statuses. Find the call that returns the location list. Extract the request headers. Then write a pure requests script that reproduces the call without a browser. Time both. The HTTP version should run in 100-200ms vs 1-2 seconds for the browser version. That speedup is what this lesson is for.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/locationsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.