The Three Layers of Modern Web Data (HTML, XHR, Mobile)
Every site exposes data through up to three layers. Knowing which layer to target, and the trade-offs of each, is the senior scraper's first decision.
What you’ll learn
- Distinguish the HTML, XHR/API, and mobile-app data layers.
- List the trade-offs (stability, parseability, auth complexity) of each.
- Pick the right layer for a given target.
- Inspect Catalog108's three-layer surface area.
A site doesn't have one source of data, it has up to three, each a layer over the same underlying database. Picking the right layer is the senior-level decision that separates a scraper that ships in a day from one that takes a week and breaks every redesign.
Layer 1: Rendered HTML
The web page a browser shows. Server-rendered or client-rendered, the final markup is what your browser's Elements panel shows.
- Discovery cost: zero. View source, write selectors, done.
- Stability: worst. Class names change, layouts shift, DOM moves every release.
- Auth complexity: low. Cookies + a session is usually all you need.
- Best for: static sites with no XHR layer, prototype scrapers, sites with very stable markup (Wikipedia, government portals), one-off academic projects.
Layer 2: XHR / API (the JSON layer)
The internal JSON API the site uses to populate itself. Visible in DevTools → Network → Fetch/XHR.
- Discovery cost: medium. You need to find the right endpoint, decode auth, replicate headers. The rest of this sub-path is dedicated to this layer.
- Stability: good. APIs change less often than markup; when they do, the breakage is typically additive (new optional fields) rather than destructive.
- Auth complexity: medium to high. JWT, OAuth, signed requests, CSRF, all live here.
- Best for: anything modern and JavaScript-heavy. SPAs, Next.js sites, dashboards, e-commerce, social feeds.
This is where 80% of professional scraping happens.
Layer 3: Mobile-app API
The endpoints the mobile app talks to. Often the same backend as the web API, but with different conventions, usually simpler auth (long-lived tokens or signed requests), no CSRF, no browser-specific headers, sometimes a fully different schema.
- Discovery cost: high. You need a proxy (mitmproxy, Charles, Proxyman), root/jailbreak or a specially configured emulator, sometimes SSL-pinning bypasses (Sub-Path 3, lessons 47–48).
- Stability: very good. Mobile apps update slowly; their APIs are kept stable for backward compatibility with old app versions.
- Auth complexity: variable. Sometimes simpler (one API key signed into a request), sometimes worse (certificate pinning).
- Best for: when the web API is locked down by an anti-bot service, when the mobile API exposes data the web doesn't, when you need long-lived tokens.
Famous example: Instagram, Twitter/X, Reddit, Uber, DoorDash, all have private mobile APIs that scrapers prefer because the web equivalents are heavily fingerprinted.
Trade-off matrix
| Layer | Discovery | Stability | Auth | When to choose |
|---|---|---|---|---|
| HTML | Easy | Bad | Easy | Pure SSR sites, no XHR available |
| XHR / API | Medium | Good | Medium | Default for modern sites |
| Mobile API | Hard | Best | Variable | Web layer is locked down |
How to inspect each layer on Catalog108
Catalog108 exposes all three layers explicitly for practice.
Layer 1, HTML:
curl https://practice.scrapingcentral.com/products | head -100
You'll see the full SSR HTML with embedded <script id="__NEXT_DATA__"> and a hydrated React tree.
Layer 2, XHR:
Visit /products in a browser, open Network → Fetch/XHR, look for /api/products:
curl https://practice.scrapingcentral.com/api/products
Returns clean JSON. Same data, no markup. This is where you'll spend Sub-Path 3.
Layer 3, mobile-style:
Catalog108 doesn't ship a real mobile app, but it exposes endpoints that mimic mobile-style auth, long-lived bearer tokens, no cookie session, no CSRF:
curl -H "Authorization: Bearer <token>" \
https://practice.scrapingcentral.com/api/products
The lessons on JWT, HMAC, and OAuth all use this mobile-shaped pattern.
The decision rule
When you sit down with a new target, the order is:
- Check XHR first. Reload with Network → Fetch/XHR open. See JSON? Done, go after that.
- If no XHR, check HTML. Pure SSR? Fine, use the static-scraping toolkit.
- If XHR is locked down (heavy auth, fingerprinting, CAPTCHAs): consider the mobile API. Set up mitmproxy on a test phone, capture, replicate.
You almost never start by writing CSS selectors against rendered HTML in 2026 unless the other two layers have been ruled out.
A worked example
Imagine you want to scrape a retailer's product catalog.
- Junior path: open Chrome, view source, find
<div class="product-tile">, write a BeautifulSoup loop, deploy, watch it break next Tuesday when marketing changes the class names. - Senior path: open Network → Fetch/XHR, find
/api/v2/products?store=123&page=1, copy as cURL, replicate. The endpoint accepts a?per_page=200parameter that the site never uses. One request gets you 200 products. Twenty requests get you 4,000. The scraper runs for a year without touching a class name.
That's the value of layer-awareness.
Python snippet: feeling all three layers
import requests
# Layer 1: HTML
html = requests.get("https://practice.scrapingcentral.com/products").text
print("HTML length:", len(html))
# Layer 2: JSON XHR
api = requests.get("https://practice.scrapingcentral.com/api/products").json()
print("Products:", len(api["products"]))
# Layer 3: authenticated, mobile-shaped
token = requests.post(
"https://practice.scrapingcentral.com/api/auth/login",
json={"email": "student@practice.scrapingcentral.com", "password": "practice123"},
).json()["access_token"]
me = requests.get(
"https://practice.scrapingcentral.com/api/auth/me",
headers={"Authorization": f"Bearer {token}"},
).json()
print("Authenticated as:", me)
PHP equivalent uses Guzzle with the same three calls. You'll see all three layers in the next several lessons.
Hands-on lab
Open /products in your browser. Use View Source to inspect Layer 1, Network → Fetch/XHR to inspect Layer 2. For Layer 3, hit /api/auth/login with the demo credentials and use the bearer token to call /api/auth/me. Note how much cleaner Layer 2 and Layer 3 are than parsing the SSR HTML, this is the shift the rest of the sub-path teaches you to make instinctively.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.