Scraping Infinite Scroll Pages
Learn techniques to scrape infinite scroll pages using Playwright and Selenium. Handle lazy-loaded content and extract all data from endlessly scrolling websites.
Infinite scroll pages load new content as the user scrolls down, replacing traditional pagination. Sites like Twitter, Instagram, Pinterest, and many e-commerce platforms use this pattern. Scraping these pages requires automating the scroll action and waiting for new content to load after each scroll.
Playwright Approach
from playwright.sync_api import sync_playwright
import time
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com/scroll")
page.wait_for_selector(".quote")
all_quotes = set()
previous_count = 0
while True:
# Scroll to the bottom of the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for new content to load
time.sleep(2)
# Extract current quotes
quotes = page.query_selector_all(".quote .text")
for q in quotes:
all_quotes.add(q.inner_text())
# Stop if no new content loaded
if len(all_quotes) == previous_count:
break
previous_count = len(all_quotes)
print(f"Scraped {len(all_quotes)} quotes")
browser.close()
Selenium Approach
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://quotes.toscrape.com/scroll")
time.sleep(2)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
quotes = driver.find_elements(By.CSS_SELECTOR, ".quote .text")
print(f"Found {len(quotes)} quotes")
driver.quit()
Smarter Scroll: Wait for Network Idle
Instead of using fixed time.sleep(), you can wait for network requests to finish:
# Playwright, wait for network to be idle after scrolling
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_load_state("networkidle")
Scrolling Inside a Container
Sometimes the scrollable element is not the page itself but a specific div:
# Playwright, scroll a specific container
page.evaluate("""
const container = document.querySelector('.results-container');
container.scrollTop = container.scrollHeight;
""")
Setting a Scroll Limit
To avoid scraping forever, set a maximum number of scrolls:
MAX_SCROLLS = 50
for i in range(MAX_SCROLLS):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_load_state("networkidle")
Easier Alternative
Infinite scroll scraping is resource-intensive and slow. If the site's data is available through an underlying API (check the Network tab in DevTools), fetching the API directly is far more efficient. For sites without a public API, ScraperAPI offers built-in infinite scroll handling via their render option, saving you from managing browser automation yourself.
Next Steps
- Handle forms, dropdowns, and click interactions
- Learn browser fingerprinting and stealth techniques
- Intercept network requests to find hidden APIs