Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

3.5beginner5 min read

Identifying the "Main" Data Endpoint

A page makes 30 requests. Three contain the data you want. Here's how to spot them in seconds, not minutes.

What you’ll learn

  • Triage a page's network log by size, response shape, and initiator.
  • Distinguish data endpoints from analytics, ads, fonts, and feature flags.
  • Use the 'block-and-reload' trick to confirm load-bearing requests.
  • Read endpoint URL patterns to guess the schema before opening Preview.

A modern page makes 30–80 network requests. Of those, two or three are the "main" data endpoints, the ones your scraper actually needs. The rest are analytics, ads, fonts, feature-flag probes, web-vitals beacons, and SDK bootstraps.

Triage is a skill. Here's how to do it fast.

The four signals

When scanning Fetch/XHR after a reload, look for:

  1. Size, data endpoints tend to be the largest JSON responses on the page. Sort the Size column descending and the answer is usually in the top three.
  2. URL shape, /api/..., /v1/..., /graphql, /search?.... Marketing names like /recommendations, /feed, /list, /products are data; /track, /beacon, /log, /metrics, /ping are not.
  3. Response Content-Type, application/json is your friend. text/plain, image/gif (the classic tracking-pixel ploy), or anything tiny is noise.
  4. Initiator, data endpoints are usually initiated by the page's framework code (e.g. RouteData.ts, ProductPage.tsx). Analytics is initiated by third-party domains (gtag.js, segment.io).

Triage in 30 seconds

Workflow:

  1. Reload with Fetch/XHR + Preserve log.
  2. Click the Size column header, sort descending.
  3. Glance at the top 5 entries. The data is almost always among them.
  4. For each, click Preview. Is the JSON structure obviously related to what's on the page? If yes, that's a main endpoint.
  5. Done. You now have one or two URLs to start with.

Try it on Catalog108: open /search?q=mug with DevTools, sort by Size. You'll see /api/products?search=mug (or similar) at or near the top with a 30+ KB JSON payload containing the product list. That's your main endpoint.

The "block it" confirmation

Not sure if a request is essential? Right-click → Block request URL → reload.

  • Page still renders correctly → the request was non-essential.
  • Page is missing data → it was load-bearing.

This is faster than reading the JSON and guessing. Three minutes of click-block-reload categorizes 30 endpoints decisively.

URL patterns that scream "main data"

After a few hundred scrapes you'll recognize patterns instantly:

Pattern What it usually is
/api/v{N}/{resource} Versioned REST API; the data
/api/{resource}?page=N Paginated list of resources
/graphql (POST) GraphQL endpoint; check the operationName
/search?q=...&page=... Search results, typically your main endpoint on a SERP-like page
/_next/data/{build}/{page}.json Next.js page data, Next's own router-level data layer
/api/feed, /api/timeline Social feeds
/api/me, /api/user Current-user info; useful as auth probe

Patterns that are almost never the main data:

Pattern What it usually is
/track, /beacon, /collect Analytics
/log, /error, /sentry Error reporting
/heartbeat, /ping Liveness pings
/flags, /features, /experiments A/B test config
/.well-known/... Standardized metadata, rarely useful

The headers-tell-the-story trick

A request's response headers reveal what kind of endpoint it is:

  • Content-Type: application/json + Cache-Control: no-store → live data API.
  • Content-Type: application/json + Cache-Control: public, max-age=3600 → cacheable data API.
  • Content-Type: text/javascript + small payload → likely a tag or pixel.
  • Content-Type: image/gif + Content-Length: 43 → a 1x1 tracking pixel, ignore.
  • X-Backend: ..., X-Cluster: ... → internal headers indicating a real backend; these endpoints matter.

Watching the right initiator

A scraper-focused trick: open the request's Initiator tab and look at the bottom of the call stack. The framework code that initiated the request usually has a meaningful name:

  • recoilSelector_useFetchProducts, clearly a Recoil selector pulling products.
  • useSWR (key=/api/products?page=1), an SWR hook keyed on the endpoint.
  • react-query: queryFn, a TanStack Query.

If you see gtag.js:122 or googletagmanager.com at the top, it's analytics. Move on.

A real example: SERP-shaped Catalog108 page

Open practice.scrapingcentral.com/search?q=mug. You'll see several XHR calls:

  • /api/products?search=mug, large JSON of matching products. Main data.
  • /api/locations?q=mug, tiny JSON of related stores. Secondary.
  • /api/featured?search=mug, featured/sponsored items. Secondary.
  • (any analytics calls), ignore.

Two minutes of inspection gives you the full picture. Most scrapers would just write a /api/products?search=mug loop and be done. If you need the local pack data, add /api/locations. If you need shopping/featured, add /api/featured. That's three endpoints, hit in parallel, replicating the entire SERP.

Python: discover-then-fetch idiom

import requests

BASE = "https://practice.scrapingcentral.com"

def scrape_query(q):
  products = requests.get(f"{BASE}/api/products", params={"search": q}).json()
  locations = requests.get(f"{BASE}/api/locations", params={"q": q}).json()
  return {"products": products, "locations": locations}

print(scrape_query("mug"))

PHP version with Guzzle, same pattern:

use GuzzleHttp\Client;

$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
function scrape($client, $q) {
  $products = json_decode($client->get('/api/products', ['query' => ['search' => $q]])->getBody(), true);
  $locations = json_decode($client->get('/api/locations', ['query' => ['q' => $q]])->getBody(), true);
  return compact('products', 'locations');
}
print_r(scrape($client, 'mug'));

The senior eye test

After you've done this a few hundred times, you'll glance at a Network log and immediately point at the right entry. The skill compresses to seconds:

  • "Biggest JSON, framework initiator, URL starts with /api/, that one."

Until then, lean on the four-signal framework, the block-and-reload trick, and the URL pattern table above.

Hands-on lab

Open practice.scrapingcentral.com/search?q=mug with DevTools → Network → Fetch/XHR. Inventory every JSON request. For each, decide: main / secondary / noise. Confirm with block-and-reload. Then write a Python or PHP script that hits each main endpoint and prints the first three results. You should end with one or two URLs and one or two parsers, that's a complete scraper for that page.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /search?q=mug

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Identifying the "Main" Data Endpoint1 / 8

When triaging a page's Network log to find the main data endpoint, what's the single fastest sort to apply?

Score so far: 0 / 0