GraphQL Scraping: Queries and Endpoints
GraphQL is a single POST endpoint, a typed schema, and a query language. Different from REST in every direction, and increasingly common.
What you’ll learn
- Recognise a GraphQL endpoint in DevTools.
- Use introspection to map the schema.
- Write queries that pull only the fields you need.
- Send GraphQL queries via Python and PHP.
GraphQL is one endpoint (usually /graphql), one HTTP method (POST), one content type (application/json), and a query language for asking exactly the fields you want. Increasingly common on modern APIs, Shopify, GitHub, Contentful, Hasura, and worth knowing.
Recognizing it in DevTools
Spot a GraphQL API by:
- All XHR/POSTs go to a single URL like
/graphqlor/api/graphql. - Request body:
{"query": "{ ... }", "variables": {...}, "operationName": "..."}. - Response body:
{"data": {...}, "errors": [...]}.
If you see this shape, it's GraphQL.
Catalog108 endpoint
Catalog108 exposes:
/api/graphql, full GraphQL with introspection enabled./api/graphql/no-introspection, rejects introspection queries (covered later)./api/graphql/persisted, sha256Hash-based persisted queries (lesson 3.42).
Introspection, the magic skeleton key
GraphQL endpoints (when introspection is enabled) expose their entire schema to anyone who asks:
INTROSPECTION_QUERY = """
{
__schema {
types {
name
kind
fields {
name
type {
name
kind
ofType { name kind }
}
}
}
}
}
"""
import requests
r = requests.post(
"https://practice.scrapingcentral.com/api/graphql",
json={"query": INTROSPECTION_QUERY},
)
schema = r.json()["data"]["__schema"]
print([t["name"] for t in schema["types"] if not t["name"].startswith("__")])
You get back: every type, every field, every relationship. The full API surface, machine-readable. Tools like Insomnia, Postman, and Apollo Studio render this graphically.
If the endpoint disables introspection (lesson 3.42), you fall back to capturing real queries from the site.
Writing queries
query GetProducts {
products(first: 12) {
id
name
price
category {
name
}
reviews(first: 5) {
author
rating
text
}
}
}
Fetches 12 products, each with selected fields and the first 5 reviews. Two key features:
- You pick the fields. No over-fetching. No under-fetching.
- Nested resources in one request. No N+1 problem, reviews come with the product.
Sending a query from Python
import requests
QUERY = """
query Products($first: Int!) {
products(first: $first) {
id
name
price
}
}
"""
r = requests.post(
"https://practice.scrapingcentral.com/api/graphql",
json={
"query": QUERY,
"variables": {"first": 12},
"operationName": "Products",
},
)
data = r.json()
if "errors" in data:
print("GraphQL errors:", data["errors"])
else:
for p in data["data"]["products"]:
print(p["id"], p["name"], p["price"])
Notes:
- Always POST, even though it conceptually "reads."
variablesfor parameters, keeps queries reusable.operationNamewhen sending multiple queries; the server picks the named one.
PHP version
use GuzzleHttp\Client;
$client = new Client();
$query = '
query Products($first: Int!) {
products(first: $first) {
id name price
}
}';
$res = $client->post('https://practice.scrapingcentral.com/api/graphql', [
'json' => [
'query' => $query,
'variables' => ['first' => 12],
'operationName' => 'Products',
],
]);
$data = json_decode($res->getBody()->getContents(), true);
foreach ($data['data']['products'] as $p) {
echo "{$p['id']} {$p['name']} {$p['price']}\n";
}
Pagination patterns
GraphQL APIs typically use one of:
- Offset:
products(offset: 0, limit: 12). - Cursor (relay-style):
products(first: 12, after: "cursor") { edges { cursor node { ... } } pageInfo { hasNextPage endCursor } }.
Relay-style is the standard for serious APIs (Shopify, GitHub):
query Page($cursor: String) {
products(first: 50, after: $cursor) {
edges {
node { id name price }
cursor
}
pageInfo {
hasNextPage
endCursor
}
}
}
Pagination loop:
cursor = None
while True:
r = requests.post(URL, json={
"query": QUERY,
"variables": {"cursor": cursor},
})
page = r.json()["data"]["products"]
for edge in page["edges"]:
yield edge["node"]
if not page["pageInfo"]["hasNextPage"]:
break
cursor = page["pageInfo"]["endCursor"]
Error handling
GraphQL almost always returns 200, even on errors. Errors come in the response body:
{
"data": {"products": null},
"errors": [
{"message": "Field 'foo' doesn't exist on type 'Product'", "path": ["products", "foo"]}
]
}
Your scraper:
def gql(query, variables=None):
r = requests.post(URL, json={"query": query, "variables": variables or {}})
r.raise_for_status()
data = r.json()
if data.get("errors"):
raise RuntimeError(f"GraphQL: {data['errors']}")
return data["data"]
Mutations
For writes:
mutation CreateReview($input: CreateReviewInput!) {
createReview(input: $input) {
id
rating
}
}
gql(MUTATION, variables={"input": {"productId": 1, "rating": 5, "text": "..."}})
Conceptually identical to queries; convention is to use mutation keyword for clarity.
Subscriptions
For real-time:
subscription PriceChange {
priceChanged { productId newPrice }
}
Run over WebSocket, not POST. Lesson 3.43 covers WebSocket scraping.
Auth on GraphQL
Same as REST:
Authorization: Bearer <token>header.- Cookies (with a Session).
- API key in header or query string.
Catalog108's /api/graphql accepts the same JWT token issued by /api/auth/login.
Why GraphQL is great for scrapers
- One endpoint to learn. Find /graphql, you know the whole API.
- Field selection. Pull only what you need; tiny responses.
- Nested fetches. Get product + reviews + category in one call.
- Introspection. Schema is machine-discoverable.
- Stable schema. Breaking changes are usually versioned carefully.
Why GraphQL is annoying for scrapers
- POST-only. Cache-unfriendly.
- Complex queries can be slow. Some servers throttle by complexity score.
- Persisted queries. Some APIs require precomputed query hashes (lesson 3.42).
- Server-side query limits. Max depth, max nodes, etc.
Hands-on lab
Hit Catalog108's /challenges/api/graphql/playground. Issue an introspection query and dump the schema. Write a query that fetches products(first: 10) { id name price reviews(first:3) { rating } }. Confirm you get nested data in one call. Then write a paginated loop using cursor-based pagination. Compare to a REST equivalent, the GraphQL version is structurally tighter.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/api/graphql/playgroundQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.