Scrape the Data Source, Not the HTML
Most modern sites render from JSON. Hitting the API directly is faster, more reliable, and structurally closer to what the site itself sees.
What you’ll learn
- Explain why scraping rendered HTML is usually the wrong layer to attack.
- Spot when a site is API-backed by inspecting Network → Fetch/XHR.
- Translate a single XHR call into a working Python request against Catalog108.
- Recognise the structural advantages of JSON over parsed HTML.
If you've come through Sub-Path 2, you know how to send a request, parse HTML, and walk a DOM tree. It works. It also leaves a lot on the table.
Every modern website is a thin presentation layer over a JSON API. The HTML you see is generated client-side from JSON, or server-side from the same JSON your browser would have fetched directly. If you scrape the HTML, you're parsing the output of a transformation. If you scrape the API, you read the input, the same data the site reads.
This sub-path is about that shift.
What "API-backed" means in practice
Open practice.scrapingcentral.com/products in your browser. Right-click → Inspect → Network → Fetch/XHR. Reload. You'll see a call to /api/products returning JSON like:
{
"products": [
{"id": 1, "name": "White wooden vase", "price": 24.99, "category": "vases"},
{"id": 2, "name": "Ceramic blue mug", "price": 12.50, "category": "mugs"}
],
"pagination": {"page": 1, "total": 240, "per_page": 12}
}
That's the source of truth. The HTML grid you see is rendered from that JSON. Scraping the HTML grid requires CSS selectors, careful escaping, brittle XPath. Scraping the JSON requires one .json() call.
The five practical wins
- Structure is free. No more guessing which
<div class="price-v2">holds the price. The API returns{"price": 24.99}as a typed number. - Pagination is explicit. APIs almost always return
{"total": 240, "page": 1, "per_page": 12}or anext_cursor. You don't have to detect "is there a next page?", you read it. - Filters are query params. Want only mugs?
?category=mugs. The HTML site uses the same endpoint with the same parameter. - It's 5–50x faster. No HTML rendering, no asset fetches, no CSS, no images. Just the data.
- It's stable. HTML markup changes every redesign. APIs change far less often because they have versioned contracts.
Why this isn't always obvious
Three reasons people default to HTML-scraping when they shouldn't:
- They don't open Network → Fetch/XHR. The API is right there, one tab away.
- The page does server-side rendering and the JSON isn't visible in the initial HTML. But if there's any interactivity, sorting, filtering, "load more", the API exists.
- Auth flows scare them off. JWT, OAuth, signed requests look intimidating. The rest of this sub-path solves all of them.
Your first API scrape
Catalog108 exposes a public REST API. Hit it directly:
import requests
r = requests.get("https://practice.scrapingcentral.com/api/products")
data = r.json()
for p in data["products"]:
print(p["id"], p["name"], p["price"])
print(f"Total: {data['pagination']['total']}")
Eight lines. No BeautifulSoup, no soup.select('.price'), no edge cases for missing fields. PHP is just as clean:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
$res = $client->get('/api/products');
$data = json_decode($res->getBody()->getContents(), true);
foreach ($data['products'] as $p) {
echo "{$p['id']} {$p['name']} {$p['price']}\n";
}
Compare against the HTML-scraping version you'd have written in Sub-Path 2, fetch the page, parse with BeautifulSoup, loop over .product-card, extract .price with a regex to strip "$", convert to float. Twenty lines, four failure modes, one redesign away from breaking.
When you can't avoid HTML
API-first isn't always possible:
- Server-side rendered sites with no XHR. Old Rails, classic WordPress, static Jekyll. The data only exists as HTML. Sub-Path 2's tools apply.
- APIs locked behind heavy auth or fingerprinting. Sometimes the cost of reverse-engineering exceeds the cost of just rendering. Pragmatism wins.
- Data assembled from multiple endpoints client-side. Rare, but happens. The HTML might be the cleanest aggregate.
But for 80% of modern sites, anything React/Vue/Angular/Next/Nuxt, the API is the right target.
What this sub-path teaches
50 lessons covering:
- REST API discovery, headers, auth, pagination, rate limits.
- Building robust clients in Python (
requests,httpx) and PHP (Guzzle, Symfony HttpClient). - All major auth flows: cookies, JWT, OAuth 2, CSRF, HMAC, API keys hidden in JS.
- SERP-scraping APIs, what they are, when to use them, how to compare providers.
- GraphQL, WebSockets, persisted queries.
- Reverse-engineering: reading minified JS, DevTools breakpoints, mitmproxy for mobile, TLS/HTTP/2 fingerprinting.
By the end you'll treat HTML as a fallback, not the default.
Hands-on lab
Open practice.scrapingcentral.com/products in your browser with Network → Fetch/XHR. Reload. Find the /api/products call. Right-click → Copy as cURL, paste in your terminal, and confirm you see JSON. Then write the eight-line Python (or six-line PHP) that fetches it and prints the first 12 products. You've just done your first API scrape, and you're going to do everything in this sub-path the same way.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.