Scraping Paginated APIs
Learn how to handle offset-based, page-based, and cursor-based pagination when scraping APIs with Python.
APIs rarely return all results at once. Instead they split data across pages. You need to iterate through each page to collect the complete dataset.
Page-Number Pagination
The simplest pattern, increment a page parameter:
import requests
all_posts = []
page = 1
while True:
response = requests.get(
"https://jsonplaceholder.typicode.com/posts",
params={"_page": page, "_limit": 20},
timeout=15,
)
response.raise_for_status()
data = response.json()
if not data:
break
all_posts.extend(data)
print(f"Page {page}: got {len(data)} posts")
page += 1
print(f"Total: {len(all_posts)} posts")
Offset-Based Pagination
Some APIs use offset and limit instead of page numbers:
import requests
all_items = []
offset = 0
limit = 50
while True:
response = requests.get(
"https://api.example.com/products",
params={"offset": offset, "limit": limit},
timeout=15,
)
response.raise_for_status()
data = response.json()
items = data.get("results", [])
if not items:
break
all_items.extend(items)
offset += limit
# Stop if we've reached the total
if offset >= data.get("total", float("inf")):
break
print(f"Collected {len(all_items)} items")
Cursor-Based Pagination
Modern APIs (Twitter, Shopify, Stripe) use cursors. The response includes a token pointing to the next batch:
import requests
all_items = []
cursor = None
while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
response = requests.get(
"https://api.example.com/orders",
params=params,
timeout=15,
)
response.raise_for_status()
data = response.json()
all_items.extend(data["items"])
cursor = data.get("next_cursor")
if not cursor:
break
print(f"Total orders: {len(all_items)}")
Link-Header Pagination
Some APIs put the next URL in the Link HTTP header (GitHub does this):
import requests
url = "https://api.github.com/users/octocat/repos"
all_repos = []
while url:
response = requests.get(url, params={"per_page": 30}, timeout=15)
response.raise_for_status()
all_repos.extend(response.json())
# Parse Link header for next page
link_header = response.headers.get("Link", "")
url = None
for part in link_header.split(","):
if 'rel="next"' in part:
url = part.split(";")[0].strip(" <>")
break
print(f"Total repos: {len(all_repos)}")
Pagination Patterns at a Glance
| Type | Parameter | Stop Condition |
|---|---|---|
| Page-number | page=1,2,3... |
Empty response |
| Offset | offset=0,50,100... |
offset >= total |
| Cursor | cursor=abc123 |
No next_cursor |
| Link header | URL in header | No rel="next" |
When scraping paginated APIs at scale, use ScraperAPI to handle proxy rotation and avoid hitting rate limits across thousands of paginated requests.
Next Steps
- Handle rate limiting between pagination requests
- Process responses with async HTTPX for speed
- Store paginated results incrementally to avoid data loss