Identifying the "Main" Data Endpoint, APIs, SERPs & Reverse Engineering

A page makes 30 requests. Three contain the data you want. Here's how to spot them in seconds, not minutes.

A modern page makes 30–80 network requests. Of those, two or three are the "main" data endpoints, the ones your scraper actually needs. The rest are analytics, ads, fonts, feature-flag probes, web-vitals beacons, and SDK bootstraps.

Triage is a skill. Here's how to do it fast.

The four signals

When scanning Fetch/XHR after a reload, look for:

Size, data endpoints tend to be the largest JSON responses on the page. Sort the Size column descending and the answer is usually in the top three.
URL shape, /api/..., /v1/..., /graphql, /search?.... Marketing names like /recommendations, /feed, /list, /products are data; /track, /beacon, /log, /metrics, /ping are not.
Response Content-Type, application/json is your friend. text/plain, image/gif (the classic tracking-pixel ploy), or anything tiny is noise.
Initiator, data endpoints are usually initiated by the page's framework code (e.g. RouteData.ts, ProductPage.tsx). Analytics is initiated by third-party domains (gtag.js, segment.io).

Triage in 30 seconds

Workflow:

Reload with Fetch/XHR + Preserve log.
Click the Size column header, sort descending.
Glance at the top 5 entries. The data is almost always among them.
For each, click Preview. Is the JSON structure obviously related to what's on the page? If yes, that's a main endpoint.
Done. You now have one or two URLs to start with.

Try it on Catalog108: open /search?q=mug with DevTools, sort by Size. You'll see /api/products?search=mug (or similar) at or near the top with a 30+ KB JSON payload containing the product list. That's your main endpoint.

The "block it" confirmation

Not sure if a request is essential? Right-click → Block request URL → reload.

Page still renders correctly → the request was non-essential.
Page is missing data → it was load-bearing.

This is faster than reading the JSON and guessing. Three minutes of click-block-reload categorizes 30 endpoints decisively.

URL patterns that scream "main data"

After a few hundred scrapes you'll recognize patterns instantly:

Pattern	What it usually is
`/api/v{N}/{resource}`	Versioned REST API; the data
`/api/{resource}?page=N`	Paginated list of resources
`/graphql` (POST)	GraphQL endpoint; check the `operationName`
`/search?q=...&page=...`	Search results, typically your main endpoint on a SERP-like page
`/_next/data/{build}/{page}.json`	Next.js page data, Next's own router-level data layer
`/api/feed`, `/api/timeline`	Social feeds
`/api/me`, `/api/user`	Current-user info; useful as auth probe

Patterns that are almost never the main data:

Pattern	What it usually is
`/track`, `/beacon`, `/collect`	Analytics
`/log`, `/error`, `/sentry`	Error reporting
`/heartbeat`, `/ping`	Liveness pings
`/flags`, `/features`, `/experiments`	A/B test config
`/.well-known/...`	Standardized metadata, rarely useful

The headers-tell-the-story trick

A request's response headers reveal what kind of endpoint it is:

Content-Type: application/json + Cache-Control: no-store → live data API.
Content-Type: application/json + Cache-Control: public, max-age=3600 → cacheable data API.
Content-Type: text/javascript + small payload → likely a tag or pixel.
Content-Type: image/gif + Content-Length: 43 → a 1x1 tracking pixel, ignore.
X-Backend: ..., X-Cluster: ... → internal headers indicating a real backend; these endpoints matter.

Watching the right initiator

A scraper-focused trick: open the request's Initiator tab and look at the bottom of the call stack. The framework code that initiated the request usually has a meaningful name:

recoilSelector_useFetchProducts, clearly a Recoil selector pulling products.
useSWR (key=/api/products?page=1), an SWR hook keyed on the endpoint.
react-query: queryFn, a TanStack Query.

If you see gtag.js:122 or googletagmanager.com at the top, it's analytics. Move on.

A real example: SERP-shaped Catalog108 page

Open practice.scrapingcentral.com/search?q=mug. You'll see several XHR calls:

/api/products?search=mug, large JSON of matching products. Main data.
/api/locations?q=mug, tiny JSON of related stores. Secondary.
/api/featured?search=mug, featured/sponsored items. Secondary.
(any analytics calls), ignore.

Two minutes of inspection gives you the full picture. Most scrapers would just write a /api/products?search=mug loop and be done. If you need the local pack data, add /api/locations. If you need shopping/featured, add /api/featured. That's three endpoints, hit in parallel, replicating the entire SERP.

Python: discover-then-fetch idiom

import requests

BASE = "https://practice.scrapingcentral.com"

def scrape_query(q):
  products = requests.get(f"{BASE}/api/products", params={"search": q}).json()
  locations = requests.get(f"{BASE}/api/locations", params={"q": q}).json()
  return {"products": products, "locations": locations}

print(scrape_query("mug"))

PHP version with Guzzle, same pattern:

use GuzzleHttp\Client;

$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
function scrape($client, $q) {
  $products = json_decode($client->get('/api/products', ['query' => ['search' => $q]])->getBody(), true);
  $locations = json_decode($client->get('/api/locations', ['query' => ['q' => $q]])->getBody(), true);
  return compact('products', 'locations');
}
print_r(scrape($client, 'mug'));

The senior eye test

After you've done this a few hundred times, you'll glance at a Network log and immediately point at the right entry. The skill compresses to seconds:

"Biggest JSON, framework initiator, URL starts with /api/, that one."

Until then, lean on the four-signal framework, the block-and-reload trick, and the URL pattern table above.

Hands-on lab

Open practice.scrapingcentral.com/search?q=mug with DevTools → Network → Fetch/XHR. Inventory every JSON request. For each, decide: main / secondary / noise. Confirm with block-and-reload. Then write a Python or PHP script that hits each main endpoint and prints the first three results. You should end with one or two URLs and one or two parsers, that's a complete scraper for that page.

Identifying the "Main" Data Endpoint

What you’ll learn

The four signals

Triage in 30 seconds

The "block it" confirmation

URL patterns that scream "main data"

The headers-tell-the-story trick

Watching the right initiator

A real example: SERP-shaped Catalog108 page

Python: discover-then-fetch idiom

The senior eye test

Hands-on lab

Hands-on lab

Quiz, check your understanding

When triaging a page's Network log to find the main data endpoint, what's the single fastest sort to apply?