How to Scrape Single Page Applications (SPAs)

Learn how to scrape Single Page Applications built with React, Vue, and Angular. Covers rendering techniques, API interception, and practical examples.

Single Page Applications (SPAs) built with React, Vue, or Angular render content dynamically in the browser. A standard HTTP request returns an empty shell. Here is how to extract data from them.

Why SPAs Are Hard to Scrape

import requests
from bs4 import BeautifulSoup

# This returns an empty page for SPAs
response = requests.get("https://spa-website.com/products")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.find_all("div", class_="product"))  # [] - empty!

The HTML source only contains a root <div> and JavaScript bundles. All content is rendered client-side after JavaScript execution.

Strategy 1: Find the Hidden API

SPAs fetch data from backend APIs. Finding and calling these APIs directly is the fastest and most reliable approach.

from playwright.sync_api import sync_playwright
import json

api_calls = []

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    def capture_api(response):
        content_type = response.headers.get("content-type", "")
        if "application/json" in content_type:
            try:
                api_calls.append({
                    "url": response.url,
                    "status": response.status,
                    "data": response.json()
                })
            except:
                pass

    page.on("response", capture_api)
    page.goto("https://spa-website.com/products")
    page.wait_for_load_state("networkidle")
    browser.close()

# Now you have the API endpoints
for call in api_calls:
    print(f"API: {call['url']}")
    print(f"Data keys: {list(call['data'].keys()) if isinstance(call['data'], dict) else 'array'}")

Once you know the API endpoint, call it directly:

import requests

# Direct API call - no browser needed
response = requests.get("https://api.spa-website.com/products?page=1&limit=50")
data = response.json()

Strategy 2: ScraperAPI with Rendering

When you cannot find or access the API directly, use ScraperAPI with JavaScript rendering.

import requests
from bs4 import BeautifulSoup

response = requests.get(
    "http://api.scraperapi.com",
    params={
        "api_key": "YOUR_SCRAPERAPI_KEY",
        "url": "https://spa-website.com/products",
        "render": "true",
        "wait_for_selector": ".product-card"  # Wait for content
    }
)

soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product-card")
print(f"Found {len(products)} products")

Strategy 3: Playwright with Wait Conditions

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    page.goto("https://spa-website.com/products")

    # Wait for the actual content to render
    page.wait_for_selector(".product-card", timeout=10000)

    # Handle infinite scroll if needed
    prev_count = 0
    while True:
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)
        products = page.query_selector_all(".product-card")
        if len(products) == prev_count:
            break
        prev_count = len(products)

    # Extract data
    for product in products:
        title = product.query_selector(".title")
        price = product.query_selector(".price")
        print(f"{title.inner_text()}: {price.inner_text()}")

    browser.close()

Decision Tree

Check for __NEXT_DATA__ or embedded JSON in the page source first
Intercept API calls using browser DevTools or Playwright
Call APIs directly if they do not require authentication
Use ScraperAPI with rendering if APIs are protected or signed
Use full browser automation as a last resort

Finding the underlying API is always the best approach. It is faster, more reliable, and returns structured data.