Identifying the "Main" Data Endpoint
A page makes 30 requests. Three contain the data you want. Here's how to spot them in seconds, not minutes.
What you’ll learn
- Triage a page's network log by size, response shape, and initiator.
- Distinguish data endpoints from analytics, ads, fonts, and feature flags.
- Use the 'block-and-reload' trick to confirm load-bearing requests.
- Read endpoint URL patterns to guess the schema before opening Preview.
A modern page makes 30–80 network requests. Of those, two or three are the "main" data endpoints, the ones your scraper actually needs. The rest are analytics, ads, fonts, feature-flag probes, web-vitals beacons, and SDK bootstraps.
Triage is a skill. Here's how to do it fast.
The four signals
When scanning Fetch/XHR after a reload, look for:
- Size, data endpoints tend to be the largest JSON responses on the page. Sort the Size column descending and the answer is usually in the top three.
- URL shape,
/api/...,/v1/...,/graphql,/search?.... Marketing names like/recommendations,/feed,/list,/productsare data;/track,/beacon,/log,/metrics,/pingare not. - Response Content-Type,
application/jsonis your friend.text/plain,image/gif(the classic tracking-pixel ploy), or anything tiny is noise. - Initiator, data endpoints are usually initiated by the page's framework code (e.g.
RouteData.ts,ProductPage.tsx). Analytics is initiated by third-party domains (gtag.js,segment.io).
Triage in 30 seconds
Workflow:
- Reload with Fetch/XHR + Preserve log.
- Click the Size column header, sort descending.
- Glance at the top 5 entries. The data is almost always among them.
- For each, click Preview. Is the JSON structure obviously related to what's on the page? If yes, that's a main endpoint.
- Done. You now have one or two URLs to start with.
Try it on Catalog108: open /search?q=mug with DevTools, sort by Size. You'll see /api/products?search=mug (or similar) at or near the top with a 30+ KB JSON payload containing the product list. That's your main endpoint.
The "block it" confirmation
Not sure if a request is essential? Right-click → Block request URL → reload.
- Page still renders correctly → the request was non-essential.
- Page is missing data → it was load-bearing.
This is faster than reading the JSON and guessing. Three minutes of click-block-reload categorizes 30 endpoints decisively.
URL patterns that scream "main data"
After a few hundred scrapes you'll recognize patterns instantly:
| Pattern | What it usually is |
|---|---|
/api/v{N}/{resource} |
Versioned REST API; the data |
/api/{resource}?page=N |
Paginated list of resources |
/graphql (POST) |
GraphQL endpoint; check the operationName |
/search?q=...&page=... |
Search results, typically your main endpoint on a SERP-like page |
/_next/data/{build}/{page}.json |
Next.js page data, Next's own router-level data layer |
/api/feed, /api/timeline |
Social feeds |
/api/me, /api/user |
Current-user info; useful as auth probe |
Patterns that are almost never the main data:
| Pattern | What it usually is |
|---|---|
/track, /beacon, /collect |
Analytics |
/log, /error, /sentry |
Error reporting |
/heartbeat, /ping |
Liveness pings |
/flags, /features, /experiments |
A/B test config |
/.well-known/... |
Standardized metadata, rarely useful |
The headers-tell-the-story trick
A request's response headers reveal what kind of endpoint it is:
Content-Type: application/json+Cache-Control: no-store→ live data API.Content-Type: application/json+Cache-Control: public, max-age=3600→ cacheable data API.Content-Type: text/javascript+ small payload → likely a tag or pixel.Content-Type: image/gif+Content-Length: 43→ a 1x1 tracking pixel, ignore.X-Backend: ...,X-Cluster: ...→ internal headers indicating a real backend; these endpoints matter.
Watching the right initiator
A scraper-focused trick: open the request's Initiator tab and look at the bottom of the call stack. The framework code that initiated the request usually has a meaningful name:
recoilSelector_useFetchProducts, clearly a Recoil selector pulling products.useSWR (key=/api/products?page=1), an SWR hook keyed on the endpoint.react-query: queryFn, a TanStack Query.
If you see gtag.js:122 or googletagmanager.com at the top, it's analytics. Move on.
A real example: SERP-shaped Catalog108 page
Open practice.scrapingcentral.com/search?q=mug. You'll see several XHR calls:
/api/products?search=mug, large JSON of matching products. Main data./api/locations?q=mug, tiny JSON of related stores. Secondary./api/featured?search=mug, featured/sponsored items. Secondary.- (any analytics calls), ignore.
Two minutes of inspection gives you the full picture. Most scrapers would just write a /api/products?search=mug loop and be done. If you need the local pack data, add /api/locations. If you need shopping/featured, add /api/featured. That's three endpoints, hit in parallel, replicating the entire SERP.
Python: discover-then-fetch idiom
import requests
BASE = "https://practice.scrapingcentral.com"
def scrape_query(q):
products = requests.get(f"{BASE}/api/products", params={"search": q}).json()
locations = requests.get(f"{BASE}/api/locations", params={"q": q}).json()
return {"products": products, "locations": locations}
print(scrape_query("mug"))
PHP version with Guzzle, same pattern:
use GuzzleHttp\Client;
$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
function scrape($client, $q) {
$products = json_decode($client->get('/api/products', ['query' => ['search' => $q]])->getBody(), true);
$locations = json_decode($client->get('/api/locations', ['query' => ['q' => $q]])->getBody(), true);
return compact('products', 'locations');
}
print_r(scrape($client, 'mug'));
The senior eye test
After you've done this a few hundred times, you'll glance at a Network log and immediately point at the right entry. The skill compresses to seconds:
- "Biggest JSON, framework initiator, URL starts with
/api/, that one."
Until then, lean on the four-signal framework, the block-and-reload trick, and the URL pattern table above.
Hands-on lab
Open practice.scrapingcentral.com/search?q=mug with DevTools → Network → Fetch/XHR. Inventory every JSON request. For each, decide: main / secondary / noise. Confirm with block-and-reload. Then write a Python or PHP script that hits each main endpoint and prints the first three results. You should end with one or two URLs and one or two parsers, that's a complete scraper for that page.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/search?q=mugQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.