Scraping SPAs: React, Vue, and Angular Sites - Browser Automation

Learn strategies for scraping single-page applications built with React, Vue, and Angular using browser automation tools.

Single-page applications present unique challenges for scrapers. The initial HTML response is typically a bare shell with a <div id="root"></div> element and a bundle of JavaScript. All content is rendered client-side, making traditional HTTP-based scraping useless. Browser automation is the primary tool for handling these sites.

Why SPAs Are Different

When you fetch an SPA with requests:

import requests
resp = requests.get("https://react-spa-example.com")
print(resp.text)
# <html><head>...</head><body><div id="root"></div><script src="bundle.js"></script></body></html>
# No actual content!

The content only appears after JavaScript executes, which requires a real browser environment.

Strategy 1: Wait for Content to Render

The simplest approach is to navigate to the page and wait for the content elements to appear:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://react-spa-example.com/products")

    # Wait for the React app to render product cards
    page.wait_for_selector("[data-testid='product-card']", timeout=15000)

    products = page.query_selector_all("[data-testid='product-card']")
    for product in products:
        name = product.query_selector("h2").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")

    browser.close()

Strategy 2: Intercept the API Calls

SPAs fetch data from backend APIs. Intercepting these calls often gives you clean JSON:

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    api_data = []

    def capture_response(response):
        if "/api/products" in response.url and response.ok:
            api_data.extend(response.json())

    page.on("response", capture_response)
    page.goto("https://react-spa-example.com/products")
    page.wait_for_load_state("networkidle")

    print(f"Captured {len(api_data)} products from API")
    for item in api_data:
        print(f"  {item['name']}: ${item['price']}")

    browser.close()

Strategy 3: Access Framework Internals

React, Vue, and Angular expose component state that you can access via JavaScript:

# React, access component state via React DevTools hook
react_data = page.evaluate("""
    () => {
        const root = document.querySelector('#root');
        const fiber = root._reactRootContainer?._internalRoot?.current;
        // Navigate the fiber tree to find your data
        // This is brittle and framework-version dependent
        return fiber ? 'React app found' : 'Not React';
    }
""")

# Vue, access component data
vue_data = page.evaluate("""
    () => {
        const app = document.querySelector('#app').__vue_app__;
        // Access Vue component data
        return app ? 'Vue app found' : 'Not Vue';
    }
""")

This approach is fragile and not recommended for production scrapers.

Handling Client-Side Routing

SPAs use client-side routing, so navigation does not trigger full page loads. Wait for content changes instead of page loads:

# Click a navigation link in an SPA
page.click("a[href='/products/electronics']")

# Don't wait for page load, wait for content change
page.wait_for_selector("h1:has-text('Electronics')")
page.wait_for_selector(".product-grid .item")

Handling Lazy-Loaded Components

SPAs often load components lazily. Trigger their loading by scrolling or interacting:

# Scroll to trigger lazy loading of below-fold content
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_selector(".lazy-loaded-section")

Recommended Approach

For most SPA scraping, the API interception strategy (Strategy 2) is the most reliable and efficient. If you want to skip browser automation entirely, ScraperAPI offers a JavaScript rendering mode that handles SPA rendering server-side. ScrapingAnt similarly renders JavaScript and returns the fully loaded HTML.

Next Steps

Manage browser contexts and sessions
Learn parallel browser scraping
Explore anti-detection techniques