Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

3.2beginner5 min read

The Three Layers of Modern Web Data (HTML, XHR, Mobile)

Every site exposes data through up to three layers. Knowing which layer to target, and the trade-offs of each, is the senior scraper's first decision.

What you’ll learn

  • Distinguish the HTML, XHR/API, and mobile-app data layers.
  • List the trade-offs (stability, parseability, auth complexity) of each.
  • Pick the right layer for a given target.
  • Inspect Catalog108's three-layer surface area.

A site doesn't have one source of data, it has up to three, each a layer over the same underlying database. Picking the right layer is the senior-level decision that separates a scraper that ships in a day from one that takes a week and breaks every redesign.

Layer 1: Rendered HTML

The web page a browser shows. Server-rendered or client-rendered, the final markup is what your browser's Elements panel shows.

  • Discovery cost: zero. View source, write selectors, done.
  • Stability: worst. Class names change, layouts shift, DOM moves every release.
  • Auth complexity: low. Cookies + a session is usually all you need.
  • Best for: static sites with no XHR layer, prototype scrapers, sites with very stable markup (Wikipedia, government portals), one-off academic projects.

Layer 2: XHR / API (the JSON layer)

The internal JSON API the site uses to populate itself. Visible in DevTools → Network → Fetch/XHR.

  • Discovery cost: medium. You need to find the right endpoint, decode auth, replicate headers. The rest of this sub-path is dedicated to this layer.
  • Stability: good. APIs change less often than markup; when they do, the breakage is typically additive (new optional fields) rather than destructive.
  • Auth complexity: medium to high. JWT, OAuth, signed requests, CSRF, all live here.
  • Best for: anything modern and JavaScript-heavy. SPAs, Next.js sites, dashboards, e-commerce, social feeds.

This is where 80% of professional scraping happens.

Layer 3: Mobile-app API

The endpoints the mobile app talks to. Often the same backend as the web API, but with different conventions, usually simpler auth (long-lived tokens or signed requests), no CSRF, no browser-specific headers, sometimes a fully different schema.

  • Discovery cost: high. You need a proxy (mitmproxy, Charles, Proxyman), root/jailbreak or a specially configured emulator, sometimes SSL-pinning bypasses (Sub-Path 3, lessons 47–48).
  • Stability: very good. Mobile apps update slowly; their APIs are kept stable for backward compatibility with old app versions.
  • Auth complexity: variable. Sometimes simpler (one API key signed into a request), sometimes worse (certificate pinning).
  • Best for: when the web API is locked down by an anti-bot service, when the mobile API exposes data the web doesn't, when you need long-lived tokens.

Famous example: Instagram, Twitter/X, Reddit, Uber, DoorDash, all have private mobile APIs that scrapers prefer because the web equivalents are heavily fingerprinted.

Trade-off matrix

Layer Discovery Stability Auth When to choose
HTML Easy Bad Easy Pure SSR sites, no XHR available
XHR / API Medium Good Medium Default for modern sites
Mobile API Hard Best Variable Web layer is locked down

How to inspect each layer on Catalog108

Catalog108 exposes all three layers explicitly for practice.

Layer 1, HTML:

curl https://practice.scrapingcentral.com/products | head -100

You'll see the full SSR HTML with embedded <script id="__NEXT_DATA__"> and a hydrated React tree.

Layer 2, XHR:

Visit /products in a browser, open Network → Fetch/XHR, look for /api/products:

curl https://practice.scrapingcentral.com/api/products

Returns clean JSON. Same data, no markup. This is where you'll spend Sub-Path 3.

Layer 3, mobile-style:

Catalog108 doesn't ship a real mobile app, but it exposes endpoints that mimic mobile-style auth, long-lived bearer tokens, no cookie session, no CSRF:

curl -H "Authorization: Bearer <token>" \
  https://practice.scrapingcentral.com/api/products

The lessons on JWT, HMAC, and OAuth all use this mobile-shaped pattern.

The decision rule

When you sit down with a new target, the order is:

  1. Check XHR first. Reload with Network → Fetch/XHR open. See JSON? Done, go after that.
  2. If no XHR, check HTML. Pure SSR? Fine, use the static-scraping toolkit.
  3. If XHR is locked down (heavy auth, fingerprinting, CAPTCHAs): consider the mobile API. Set up mitmproxy on a test phone, capture, replicate.

You almost never start by writing CSS selectors against rendered HTML in 2026 unless the other two layers have been ruled out.

A worked example

Imagine you want to scrape a retailer's product catalog.

  • Junior path: open Chrome, view source, find <div class="product-tile">, write a BeautifulSoup loop, deploy, watch it break next Tuesday when marketing changes the class names.
  • Senior path: open Network → Fetch/XHR, find /api/v2/products?store=123&page=1, copy as cURL, replicate. The endpoint accepts a ?per_page=200 parameter that the site never uses. One request gets you 200 products. Twenty requests get you 4,000. The scraper runs for a year without touching a class name.

That's the value of layer-awareness.

Python snippet: feeling all three layers

import requests

# Layer 1: HTML
html = requests.get("https://practice.scrapingcentral.com/products").text
print("HTML length:", len(html))

# Layer 2: JSON XHR
api = requests.get("https://practice.scrapingcentral.com/api/products").json()
print("Products:", len(api["products"]))

# Layer 3: authenticated, mobile-shaped
token = requests.post(
  "https://practice.scrapingcentral.com/api/auth/login",
  json={"email": "student@practice.scrapingcentral.com", "password": "practice123"},
).json()["access_token"]
me = requests.get(
  "https://practice.scrapingcentral.com/api/auth/me",
  headers={"Authorization": f"Bearer {token}"},
).json()
print("Authenticated as:", me)

PHP equivalent uses Guzzle with the same three calls. You'll see all three layers in the next several lessons.

Hands-on lab

Open /products in your browser. Use View Source to inspect Layer 1, Network → Fetch/XHR to inspect Layer 2. For Layer 3, hit /api/auth/login with the demo credentials and use the bearer token to call /api/auth/me. Note how much cleaner Layer 2 and Layer 3 are than parsing the SSR HTML, this is the shift the rest of the sub-path teaches you to make instinctively.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

The Three Layers of Modern Web Data (HTML, XHR, Mobile)1 / 8

Which of the three layers, HTML, XHR API, mobile API, typically offers the BEST stability over time?

Score so far: 0 / 0