Scraping Paginated APIs - API Scraping

Learn how to handle offset-based, page-based, and cursor-based pagination when scraping APIs with Python.

APIs rarely return all results at once. Instead they split data across pages. You need to iterate through each page to collect the complete dataset.

Page-Number Pagination

The simplest pattern, increment a page parameter:

import requests

all_posts = []
page = 1

while True:
    response = requests.get(
        "https://jsonplaceholder.typicode.com/posts",
        params={"_page": page, "_limit": 20},
        timeout=15,
    )
    response.raise_for_status()
    data = response.json()

    if not data:
        break

    all_posts.extend(data)
    print(f"Page {page}: got {len(data)} posts")
    page += 1

print(f"Total: {len(all_posts)} posts")

Offset-Based Pagination

Some APIs use offset and limit instead of page numbers:

import requests

all_items = []
offset = 0
limit = 50

while True:
    response = requests.get(
        "https://api.example.com/products",
        params={"offset": offset, "limit": limit},
        timeout=15,
    )
    response.raise_for_status()
    data = response.json()

    items = data.get("results", [])
    if not items:
        break

    all_items.extend(items)
    offset += limit

    # Stop if we've reached the total
    if offset >= data.get("total", float("inf")):
        break

print(f"Collected {len(all_items)} items")

Cursor-Based Pagination

Modern APIs (Twitter, Shopify, Stripe) use cursors. The response includes a token pointing to the next batch:

import requests

all_items = []
cursor = None

while True:
    params = {"limit": 100}
    if cursor:
        params["cursor"] = cursor

    response = requests.get(
        "https://api.example.com/orders",
        params=params,
        timeout=15,
    )
    response.raise_for_status()
    data = response.json()

    all_items.extend(data["items"])
    cursor = data.get("next_cursor")

    if not cursor:
        break

print(f"Total orders: {len(all_items)}")

Link-Header Pagination

Some APIs put the next URL in the Link HTTP header (GitHub does this):

import requests

url = "https://api.github.com/users/octocat/repos"
all_repos = []

while url:
    response = requests.get(url, params={"per_page": 30}, timeout=15)
    response.raise_for_status()
    all_repos.extend(response.json())

    # Parse Link header for next page
    link_header = response.headers.get("Link", "")
    url = None
    for part in link_header.split(","):
        if 'rel="next"' in part:
            url = part.split(";")[0].strip(" <>")
            break

print(f"Total repos: {len(all_repos)}")

Pagination Patterns at a Glance

Type	Parameter	Stop Condition
Page-number	`page=1,2,3...`	Empty response
Offset	`offset=0,50,100...`	`offset >= total`
Cursor	`cursor=abc123`	No `next_cursor`
Link header	URL in header	No `rel="next"`

When scraping paginated APIs at scale, use ScraperAPI to handle proxy rotation and avoid hitting rate limits across thousands of paginated requests.

Next Steps

Handle rate limiting between pagination requests
Process responses with async HTTPX for speed
Store paginated results incrementally to avoid data loss