Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

3.41intermediate5 min read

GraphQL Scraping: Queries and Endpoints

GraphQL is a single POST endpoint, a typed schema, and a query language. Different from REST in every direction, and increasingly common.

What you’ll learn

  • Recognise a GraphQL endpoint in DevTools.
  • Use introspection to map the schema.
  • Write queries that pull only the fields you need.
  • Send GraphQL queries via Python and PHP.

GraphQL is one endpoint (usually /graphql), one HTTP method (POST), one content type (application/json), and a query language for asking exactly the fields you want. Increasingly common on modern APIs, Shopify, GitHub, Contentful, Hasura, and worth knowing.

Recognizing it in DevTools

Spot a GraphQL API by:

  • All XHR/POSTs go to a single URL like /graphql or /api/graphql.
  • Request body: {"query": "{ ... }", "variables": {...}, "operationName": "..."}.
  • Response body: {"data": {...}, "errors": [...]}.

If you see this shape, it's GraphQL.

Catalog108 endpoint

Catalog108 exposes:

  • /api/graphql, full GraphQL with introspection enabled.
  • /api/graphql/no-introspection, rejects introspection queries (covered later).
  • /api/graphql/persisted, sha256Hash-based persisted queries (lesson 3.42).

Introspection, the magic skeleton key

GraphQL endpoints (when introspection is enabled) expose their entire schema to anyone who asks:

INTROSPECTION_QUERY = """
{
  __schema {
  types {
  name
  kind
  fields {
  name
  type {
  name
  kind
  ofType { name kind }
  }
  }
  }
  }
}
"""

import requests
r = requests.post(
  "https://practice.scrapingcentral.com/api/graphql",
  json={"query": INTROSPECTION_QUERY},
)
schema = r.json()["data"]["__schema"]
print([t["name"] for t in schema["types"] if not t["name"].startswith("__")])

You get back: every type, every field, every relationship. The full API surface, machine-readable. Tools like Insomnia, Postman, and Apollo Studio render this graphically.

If the endpoint disables introspection (lesson 3.42), you fall back to capturing real queries from the site.

Writing queries

query GetProducts {
  products(first: 12) {
  id
  name
  price
  category {
  name
  }
  reviews(first: 5) {
  author
  rating
  text
  }
  }
}

Fetches 12 products, each with selected fields and the first 5 reviews. Two key features:

  • You pick the fields. No over-fetching. No under-fetching.
  • Nested resources in one request. No N+1 problem, reviews come with the product.

Sending a query from Python

import requests

QUERY = """
query Products($first: Int!) {
  products(first: $first) {
  id
  name
  price
  }
}
"""

r = requests.post(
  "https://practice.scrapingcentral.com/api/graphql",
  json={
  "query": QUERY,
  "variables": {"first": 12},
  "operationName": "Products",
  },
)
data = r.json()
if "errors" in data:
  print("GraphQL errors:", data["errors"])
else:
  for p in data["data"]["products"]:
  print(p["id"], p["name"], p["price"])

Notes:

  • Always POST, even though it conceptually "reads."
  • variables for parameters, keeps queries reusable.
  • operationName when sending multiple queries; the server picks the named one.

PHP version

use GuzzleHttp\Client;

$client = new Client();
$query = '
query Products($first: Int!) {
  products(first: $first) {
  id name price
  }
}';

$res = $client->post('https://practice.scrapingcentral.com/api/graphql', [
  'json' => [
  'query' => $query,
  'variables' => ['first' => 12],
  'operationName' => 'Products',
  ],
]);

$data = json_decode($res->getBody()->getContents(), true);
foreach ($data['data']['products'] as $p) {
  echo "{$p['id']} {$p['name']} {$p['price']}\n";
}

Pagination patterns

GraphQL APIs typically use one of:

  • Offset: products(offset: 0, limit: 12).
  • Cursor (relay-style): products(first: 12, after: "cursor") { edges { cursor node { ... } } pageInfo { hasNextPage endCursor } }.

Relay-style is the standard for serious APIs (Shopify, GitHub):

query Page($cursor: String) {
  products(first: 50, after: $cursor) {
  edges {
  node { id name price }
  cursor
  }
  pageInfo {
  hasNextPage
  endCursor
  }
  }
}

Pagination loop:

cursor = None
while True:
  r = requests.post(URL, json={
  "query": QUERY,
  "variables": {"cursor": cursor},
  })
  page = r.json()["data"]["products"]
  for edge in page["edges"]:
  yield edge["node"]
  if not page["pageInfo"]["hasNextPage"]:
  break
  cursor = page["pageInfo"]["endCursor"]

Error handling

GraphQL almost always returns 200, even on errors. Errors come in the response body:

{
  "data": {"products": null},
  "errors": [
  {"message": "Field 'foo' doesn't exist on type 'Product'", "path": ["products", "foo"]}
  ]
}

Your scraper:

def gql(query, variables=None):
  r = requests.post(URL, json={"query": query, "variables": variables or {}})
  r.raise_for_status()
  data = r.json()
  if data.get("errors"):
  raise RuntimeError(f"GraphQL: {data['errors']}")
  return data["data"]

Mutations

For writes:

mutation CreateReview($input: CreateReviewInput!) {
  createReview(input: $input) {
  id
  rating
  }
}
gql(MUTATION, variables={"input": {"productId": 1, "rating": 5, "text": "..."}})

Conceptually identical to queries; convention is to use mutation keyword for clarity.

Subscriptions

For real-time:

subscription PriceChange {
  priceChanged { productId newPrice }
}

Run over WebSocket, not POST. Lesson 3.43 covers WebSocket scraping.

Auth on GraphQL

Same as REST:

  • Authorization: Bearer <token> header.
  • Cookies (with a Session).
  • API key in header or query string.

Catalog108's /api/graphql accepts the same JWT token issued by /api/auth/login.

Why GraphQL is great for scrapers

  • One endpoint to learn. Find /graphql, you know the whole API.
  • Field selection. Pull only what you need; tiny responses.
  • Nested fetches. Get product + reviews + category in one call.
  • Introspection. Schema is machine-discoverable.
  • Stable schema. Breaking changes are usually versioned carefully.

Why GraphQL is annoying for scrapers

  • POST-only. Cache-unfriendly.
  • Complex queries can be slow. Some servers throttle by complexity score.
  • Persisted queries. Some APIs require precomputed query hashes (lesson 3.42).
  • Server-side query limits. Max depth, max nodes, etc.

Hands-on lab

Hit Catalog108's /challenges/api/graphql/playground. Issue an introspection query and dump the schema. Write a query that fetches products(first: 10) { id name price reviews(first:3) { rating } }. Confirm you get nested data in one call. Then write a paginated loop using cursor-based pagination. Compare to a REST equivalent, the GraphQL version is structurally tighter.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/api/graphql/playground

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

GraphQL Scraping: Queries and Endpoints1 / 8

Which HTTP method does GraphQL use, even for read queries?

Score so far: 0 / 0