Scrape the Data Source, Not the HTML, APIs, SERPs & Reverse Engineering

Most modern sites render from JSON. Hitting the API directly is faster, more reliable, and structurally closer to what the site itself sees.

If you've come through Sub-Path 2, you know how to send a request, parse HTML, and walk a DOM tree. It works. It also leaves a lot on the table.

Every modern website is a thin presentation layer over a JSON API. The HTML you see is generated client-side from JSON, or server-side from the same JSON your browser would have fetched directly. If you scrape the HTML, you're parsing the output of a transformation. If you scrape the API, you read the input, the same data the site reads.

This sub-path is about that shift.

What "API-backed" means in practice

Open practice.scrapingcentral.com/products in your browser. Right-click → Inspect → Network → Fetch/XHR. Reload. You'll see a call to /api/products returning JSON like:

{
  "products": [
  {"id": 1, "name": "White wooden vase", "price": 24.99, "category": "vases"},
  {"id": 2, "name": "Ceramic blue mug", "price": 12.50, "category": "mugs"}
  ],
  "pagination": {"page": 1, "total": 240, "per_page": 12}
}

That's the source of truth. The HTML grid you see is rendered from that JSON. Scraping the HTML grid requires CSS selectors, careful escaping, brittle XPath. Scraping the JSON requires one .json() call.

The five practical wins

Structure is free. No more guessing which <div class="price-v2"> holds the price. The API returns {"price": 24.99} as a typed number.
Pagination is explicit. APIs almost always return {"total": 240, "page": 1, "per_page": 12} or a next_cursor. You don't have to detect "is there a next page?", you read it.
Filters are query params. Want only mugs? ?category=mugs. The HTML site uses the same endpoint with the same parameter.
It's 5–50x faster. No HTML rendering, no asset fetches, no CSS, no images. Just the data.
It's stable. HTML markup changes every redesign. APIs change far less often because they have versioned contracts.

Why this isn't always obvious

Three reasons people default to HTML-scraping when they shouldn't:

They don't open Network → Fetch/XHR. The API is right there, one tab away.
The page does server-side rendering and the JSON isn't visible in the initial HTML. But if there's any interactivity, sorting, filtering, "load more", the API exists.
Auth flows scare them off. JWT, OAuth, signed requests look intimidating. The rest of this sub-path solves all of them.

Your first API scrape

Catalog108 exposes a public REST API. Hit it directly:

import requests

r = requests.get("https://practice.scrapingcentral.com/api/products")
data = r.json()

for p in data["products"]:
  print(p["id"], p["name"], p["price"])

print(f"Total: {data['pagination']['total']}")

Eight lines. No BeautifulSoup, no soup.select('.price'), no edge cases for missing fields. PHP is just as clean:

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
$res = $client->get('/api/products');
$data = json_decode($res->getBody()->getContents(), true);

foreach ($data['products'] as $p) {
  echo "{$p['id']} {$p['name']} {$p['price']}\n";
}

Compare against the HTML-scraping version you'd have written in Sub-Path 2, fetch the page, parse with BeautifulSoup, loop over .product-card, extract .price with a regex to strip "$", convert to float. Twenty lines, four failure modes, one redesign away from breaking.

When you can't avoid HTML

API-first isn't always possible:

Server-side rendered sites with no XHR. Old Rails, classic WordPress, static Jekyll. The data only exists as HTML. Sub-Path 2's tools apply.
APIs locked behind heavy auth or fingerprinting. Sometimes the cost of reverse-engineering exceeds the cost of just rendering. Pragmatism wins.
Data assembled from multiple endpoints client-side. Rare, but happens. The HTML might be the cleanest aggregate.

But for 80% of modern sites, anything React/Vue/Angular/Next/Nuxt, the API is the right target.

What this sub-path teaches

50 lessons covering:

REST API discovery, headers, auth, pagination, rate limits.
Building robust clients in Python (requests, httpx) and PHP (Guzzle, Symfony HttpClient).
All major auth flows: cookies, JWT, OAuth 2, CSRF, HMAC, API keys hidden in JS.
SERP-scraping APIs, what they are, when to use them, how to compare providers.
GraphQL, WebSockets, persisted queries.
Reverse-engineering: reading minified JS, DevTools breakpoints, mitmproxy for mobile, TLS/HTTP/2 fingerprinting.

By the end you'll treat HTML as a fallback, not the default.

Hands-on lab

Open practice.scrapingcentral.com/products in your browser with Network → Fetch/XHR. Reload. Find the /api/products call. Right-click → Copy as cURL, paste in your terminal, and confirm you see JSON. Then write the eight-line Python (or six-line PHP) that fetches it and prints the first 12 products. You've just done your first API scrape, and you're going to do everything in this sub-path the same way.

Scrape the Data Source, Not the HTML

What you’ll learn

What "API-backed" means in practice

The five practical wins

Why this isn't always obvious

Your first API scrape

When you can't avoid HTML

What this sub-path teaches

Hands-on lab

Hands-on lab

Quiz, check your understanding

What is the primary structural advantage of scraping a JSON API over scraping the rendered HTML?