Guide
How to Scrape Single Page Applications (SPAs)
Learn how to scrape Single Page Applications built with React, Vue, and Angular. Covers rendering techniques, API interception, and practical examples.
Single Page Applications (SPAs) built with React, Vue, or Angular render content dynamically in the browser. A standard HTTP request returns an empty shell. Here is how to extract data from them.
Why SPAs Are Hard to Scrape
import requests
from bs4 import BeautifulSoup
# This returns an empty page for SPAs
response = requests.get("https://spa-website.com/products")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.find_all("div", class_="product")) # [] - empty!
The HTML source only contains a root <div> and JavaScript bundles. All content is rendered client-side after JavaScript execution.
Strategy 1: Find the Hidden API
SPAs fetch data from backend APIs. Finding and calling these APIs directly is the fastest and most reliable approach.
from playwright.sync_api import sync_playwright
import json
api_calls = []
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
def capture_api(response):
content_type = response.headers.get("content-type", "")
if "application/json" in content_type:
try:
api_calls.append({
"url": response.url,
"status": response.status,
"data": response.json()
})
except:
pass
page.on("response", capture_api)
page.goto("https://spa-website.com/products")
page.wait_for_load_state("networkidle")
browser.close()
# Now you have the API endpoints
for call in api_calls:
print(f"API: {call['url']}")
print(f"Data keys: {list(call['data'].keys()) if isinstance(call['data'], dict) else 'array'}")
Once you know the API endpoint, call it directly:
import requests
# Direct API call - no browser needed
response = requests.get("https://api.spa-website.com/products?page=1&limit=50")
data = response.json()
Strategy 2: ScraperAPI with Rendering
When you cannot find or access the API directly, use ScraperAPI with JavaScript rendering.
import requests
from bs4 import BeautifulSoup
response = requests.get(
"http://api.scraperapi.com",
params={
"api_key": "YOUR_SCRAPERAPI_KEY",
"url": "https://spa-website.com/products",
"render": "true",
"wait_for_selector": ".product-card" # Wait for content
}
)
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product-card")
print(f"Found {len(products)} products")
Strategy 3: Playwright with Wait Conditions
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://spa-website.com/products")
# Wait for the actual content to render
page.wait_for_selector(".product-card", timeout=10000)
# Handle infinite scroll if needed
prev_count = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
products = page.query_selector_all(".product-card")
if len(products) == prev_count:
break
prev_count = len(products)
# Extract data
for product in products:
title = product.query_selector(".title")
price = product.query_selector(".price")
print(f"{title.inner_text()}: {price.inner_text()}")
browser.close()
Decision Tree
- Check for
__NEXT_DATA__or embedded JSON in the page source first - Intercept API calls using browser DevTools or Playwright
- Call APIs directly if they do not require authentication
- Use ScraperAPI with rendering if APIs are protected or signed
- Use full browser automation as a last resort
Finding the underlying API is always the best approach. It is faster, more reliable, and returns structured data.