How to Scrape Next.js and React Websites

Learn how to scrape Next.js and React websites effectively. Covers SSR vs CSR detection, API route discovery, and practical scraping techniques.

Next.js and React power a huge portion of modern websites. Scraping them requires understanding their rendering modes and data-fetching patterns. Here is a practical guide.

Determine the Rendering Mode

Next.js can render pages in three ways, and your scraping approach depends on which is used:

Server-Side Rendering (SSR), Full HTML returned on each request. Standard HTTP scraping works.
Static Site Generation (SSG), Pre-built HTML pages. Standard HTTP scraping works.
Client-Side Rendering (CSR), JavaScript builds the page in the browser. Requires browser rendering.

Quick Test

import requests
from bs4 import BeautifulSoup

response = requests.get("https://nextjs-site.com/products")
soup = BeautifulSoup(response.text, "html.parser")

# If content is present, it is SSR/SSG
products = soup.find_all("div", class_="product-card")
if products:
    print(f"SSR/SSG detected: {len(products)} products found")
else:
    print("CSR detected: need browser rendering")

Scraping the NEXT_DATA JSON

Next.js embeds page data in a <script id="__NEXT_DATA__"> tag. This is the easiest way to get structured data.

import requests
import json
from bs4 import BeautifulSoup

response = requests.get("https://nextjs-site.com/products")
soup = BeautifulSoup(response.text, "html.parser")

next_data = soup.find("script", id="__NEXT_DATA__")
if next_data:
    data = json.loads(next_data.string)
    props = data["props"]["pageProps"]
    print(json.dumps(props, indent=2)[:500])

Intercepting API Routes

Next.js apps often fetch data from /api/* routes or external APIs. Intercept these for clean JSON data.

from playwright.sync_api import sync_playwright

api_responses = []

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    def capture_api(response):
        if "/api/" in response.url or "graphql" in response.url:
            try:
                api_responses.append({
                    "url": response.url,
                    "data": response.json()
                })
            except:
                pass

    page.on("response", capture_api)
    page.goto("https://nextjs-site.com/products")
    page.wait_for_load_state("networkidle")

    for resp in api_responses:
        print(f"API: {resp['url']}")
        print(f"Data keys: {list(resp['data'].keys()) if isinstance(resp['data'], dict) else 'array'}")

    browser.close()

CSR Sites: Use ScraperAPI

For client-side rendered React sites, use ScraperAPI with rendering enabled.

import requests
from bs4 import BeautifulSoup

response = requests.get(
    "http://api.scraperapi.com",
    params={
        "api_key": "YOUR_SCRAPERAPI_KEY",
        "url": "https://react-csr-site.com/products",
        "render": "true"
    }
)

soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product-card")
print(f"Found {len(products)} products")

Key Strategy

Always check for __NEXT_DATA__ first. It contains structured JSON data that is far easier to parse than HTML. If that is not available, look for API routes. Use browser rendering only as a last resort.