Pagination, The 5 Common Patterns and How to Detect Them
Every paginated site uses one of five patterns: numbered, offset, cursor, load-more, or unknown-end. Identify which, scrape it correctly, stop at the right time.
What you’ll learn
- Recognise the 5 pagination patterns from URL and DOM evidence.
- Implement a paging loop for each, with correct stop conditions.
- Handle 'unknown total' gracefully without infinite loops.
- Choose between following the rendered Next link and constructing URLs.
Pagination is where scrapers most often go wrong: missing pages, infinite loops, duplicate items, off-by-one bugs. The good news: there are only five common patterns. Once you can spot which one a site uses, the scraping is mechanical.
The five patterns
| Pattern | URL shape | Stop condition | Catalog108 lab |
|---|---|---|---|
| Numbered | ?page=N |
Last page known or empty list | /challenges/static/pagination/numbered |
| Offset / limit | ?offset=20&limit=20 |
total field or empty page |
/challenges/static/pagination/offset |
| Cursor | ?cursor=abc123 |
next_cursor is empty/null |
/challenges/static/pagination/cursor |
| Load-more (HTTP) | "Load more" link triggers a GET | No more "next" link | /challenges/static/pagination/load-more-http |
| Unknown end | Any of above, but total not exposed | Empty page or duplicate detection | /challenges/static/pagination/unknown-end |
Pattern 1: Numbered pagination
The classic. ?page=1, ?page=2, etc. The page renders a list of page numbers OR a "Next" link.
import requests
from bs4 import BeautifulSoup
BASE = "https://practice.scrapingcentral.com"
all_items = []
for page in range(1, 1000): # 1000 is a safety upper bound, not the real limit
r = requests.get(f"{BASE}/challenges/static/pagination/numbered", params={"page": page}, timeout=10)
r.raise_for_status()
soup = BeautifulSoup(r.content, "lxml")
items = soup.select(".item")
if not items:
break
all_items.extend(item.get_text(strip=True) for item in items)
# Optional faster exit: check for visible "Next" link
if not soup.select_one("a.next"):
break
print(len(all_items))
Two stop conditions:
- Empty list of items, clearest signal we've gone past the end.
- No "next" link, anti-overshoot, useful when items per page are inconsistent.
Always have an outer upper-bound loop too (range(1, 1000)). If both stop signals fail (bug, layout change), at least your loop terminates.
Pattern 2: Offset / limit
Often used in API-style URLs:
offset = 0
limit = 20
while True:
r = requests.get(f"{BASE}/challenges/static/pagination/offset",
params={"offset": offset, "limit": limit})
items = r.json()["items"]
if not items:
break
all_items.extend(items)
offset += limit
If the response includes a total count, use it for early termination and a progress bar:
data = r.json()
total = data["total"]
print(f"{offset + len(items)}/{total}")
if offset + len(items) >= total:
break
Pattern 3: Cursor-based
The server gives you an opaque next_cursor token. You send it back on the next request; you stop when you get an empty/null cursor.
cursor = None
while True:
params = {}
if cursor is not None:
params["cursor"] = cursor
r = requests.get(f"{BASE}/challenges/static/pagination/cursor", params=params)
data = r.json()
all_items.extend(data["items"])
cursor = data.get("next_cursor")
if not cursor:
break
Don't parse or guess the cursor format. Treat it as opaque. The server is the source of truth on "what comes next."
Cursor pagination is the most reliable for huge or rapidly-changing datasets, adding or removing items mid-scrape doesn't shift your position the way offset pagination would.
Pattern 4: Load-more (HTTP)
Some pages have a "Load more" button that does a regular HTTP GET (no JS) for the next chunk. The trick is finding that URL:
url = f"{BASE}/challenges/static/pagination/load-more-http"
while url:
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
all_items.extend(it.get_text(strip=True) for it in soup.select(".item"))
next_btn = soup.select_one("a.load-more")
url = next_btn["href"] if next_btn else None
You follow the rendered "Load more" link until it disappears. Often the URL contains an offset, cursor, or session ID, let the server set it, don't construct it yourself.
If the button is client-side JS-only (no <a href>), it's not actually static pagination, it's an XHR. Open DevTools → Network → click "Load more" once, find the request, and replicate. That's API-scraping territory (Sub-Path 4).
Pattern 5: Unknown end
No page count, no total, no cursor that explicitly says "this is the last." You only know you're done when the page returns nothing new.
Two failure modes to avoid:
- Infinite loop if the server returns the same data on overshoot (
?page=999returns page 1 content). - Missing the last page if your stop signal triggers prematurely.
The robust approach: track a fingerprint of what you've seen:
seen_first_item = None
page = 1
while page < 10000: # outer safety
r = requests.get(url, params={"page": page})
soup = BeautifulSoup(r.content, "lxml")
items = soup.select(".item")
if not items:
break
first = items[0].get_text(strip=True)
if first == seen_first_item:
break # server is looping
seen_first_item = first
all_items.extend(it.get_text(strip=True) for it in items)
page += 1
For really paranoid scrapes, dedupe at the data level (Lesson 1.33). Two same-content pages in a row is your stop signal.
How to identify which pattern from the page
Open DevTools, look at the URL when you click "Next" or scroll:
- URL contains
?page=N→ numbered. - URL contains
?offset=N→ offset. - URL contains a random-looking
cursor=value → cursor. - A button on the page does a GET to a deeper URL → load-more.
- None of the above visible → look at the API the JS calls (DevTools → Network).
Follow rendered links vs. construct URLs
Two philosophies:
- Construct: predict the URL pattern (
?page=N) and increment. Faster, but assumes you know the pattern. - Follow: parse the "Next" link from the page itself. Slower (one extra parse per page) but works on weird formats, signed URLs, or session-tied cursors.
For unknown sites, follow first. Once you've confirmed the pattern, switch to construct for speed if needed.
Polite paging
Add a small delay between page requests, Lesson 1.28 covers polite scraping in depth. For now:
import time
for page in range(1, 100):
...
time.sleep(0.5) # half a second between page fetches
Even a 0.2s delay is enough to avoid hammering most servers. Total time on a 100-page scrape: 20 seconds added. Worth it.
A unified paging helper
def paginate_numbered(url, params=None, max_pages=10000, sleep=0.5):
page = 1
params = dict(params or {})
while page < max_pages:
params["page"] = page
r = requests.get(url, params=params, timeout=10)
r.raise_for_status()
yield r
page += 1
time.sleep(sleep)
A generator that yields each response. The caller checks for emptiness and breaks. Reusable across most numbered-pagination sites.
Hands-on lab
All four pagination challenges have their own URLs at /challenges/static/pagination/. Pick one (start with numbered), identify the pattern from the URL when you click pagination links in your browser, then write the paging loop. Repeat with offset, cursor, and load-more-http. Then test your loop against unknown-end, it should terminate cleanly without an infinite loop.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/pagination/numberedQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.