Scraping SPAs: React, Vue, and Angular Sites
Learn strategies for scraping single-page applications built with React, Vue, and Angular using browser automation tools.
Single-page applications present unique challenges for scrapers. The initial HTML response is typically a bare shell with a <div id="root"></div> element and a bundle of JavaScript. All content is rendered client-side, making traditional HTTP-based scraping useless. Browser automation is the primary tool for handling these sites.
Why SPAs Are Different
When you fetch an SPA with requests:
import requests
resp = requests.get("https://react-spa-example.com")
print(resp.text)
# <html><head>...</head><body><div id="root"></div><script src="bundle.js"></script></body></html>
# No actual content!
The content only appears after JavaScript executes, which requires a real browser environment.
Strategy 1: Wait for Content to Render
The simplest approach is to navigate to the page and wait for the content elements to appear:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://react-spa-example.com/products")
# Wait for the React app to render product cards
page.wait_for_selector("[data-testid='product-card']", timeout=15000)
products = page.query_selector_all("[data-testid='product-card']")
for product in products:
name = product.query_selector("h2").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{name}: {price}")
browser.close()
Strategy 2: Intercept the API Calls
SPAs fetch data from backend APIs. Intercepting these calls often gives you clean JSON:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
api_data = []
def capture_response(response):
if "/api/products" in response.url and response.ok:
api_data.extend(response.json())
page.on("response", capture_response)
page.goto("https://react-spa-example.com/products")
page.wait_for_load_state("networkidle")
print(f"Captured {len(api_data)} products from API")
for item in api_data:
print(f" {item['name']}: ${item['price']}")
browser.close()
Strategy 3: Access Framework Internals
React, Vue, and Angular expose component state that you can access via JavaScript:
# React, access component state via React DevTools hook
react_data = page.evaluate("""
() => {
const root = document.querySelector('#root');
const fiber = root._reactRootContainer?._internalRoot?.current;
// Navigate the fiber tree to find your data
// This is brittle and framework-version dependent
return fiber ? 'React app found' : 'Not React';
}
""")
# Vue, access component data
vue_data = page.evaluate("""
() => {
const app = document.querySelector('#app').__vue_app__;
// Access Vue component data
return app ? 'Vue app found' : 'Not Vue';
}
""")
This approach is fragile and not recommended for production scrapers.
Handling Client-Side Routing
SPAs use client-side routing, so navigation does not trigger full page loads. Wait for content changes instead of page loads:
# Click a navigation link in an SPA
page.click("a[href='/products/electronics']")
# Don't wait for page load, wait for content change
page.wait_for_selector("h1:has-text('Electronics')")
page.wait_for_selector(".product-grid .item")
Handling Lazy-Loaded Components
SPAs often load components lazily. Trigger their loading by scrolling or interacting:
# Scroll to trigger lazy loading of below-fold content
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_selector(".lazy-loaded-section")
Recommended Approach
For most SPA scraping, the API interception strategy (Strategy 2) is the most reliable and efficient. If you want to skip browser automation entirely, ScraperAPI offers a JavaScript rendering mode that handles SPA rendering server-side. ScrapingAnt similarly renders JavaScript and returns the fully loaded HTML.
Next Steps
- Manage browser contexts and sessions
- Learn parallel browser scraping
- Explore anti-detection techniques