Web Scraping with Node.js - Cheerio, Playwright, Crawlee

Learn web scraping with Node.js using Cheerio for HTML parsing, Playwright for browser automation, and Crawlee for full scraping pipelines.

Node.js is a strong choice for web scraping thanks to its async nature and the ecosystem of scraping libraries. Here are the three main approaches.

1. Cheerio - Fast HTML Parsing

Cheerio provides a jQuery-like API for parsing HTML. It is fast and lightweight but does not execute JavaScript.

npm install cheerio axios

import axios from 'axios';
import * as cheerio from 'cheerio';

async function scrapeProducts(url) {
    const { data } = await axios.get(url);
    const $ = cheerio.load(data);

    const products = [];
    $('.product-card').each((i, el) => {
        products.push({
            name: $(el).find('.product-name').text().trim(),
            price: $(el).find('.product-price').text().trim(),
            url: $(el).find('a').attr('href'),
            image: $(el).find('img').attr('src')
        });
    });

    return products;
}

const products = await scrapeProducts('https://example.com/products');
console.log(`Found ${products.length} products`);

Using Cheerio with ScraperAPI

import axios from 'axios';
import * as cheerio from 'cheerio';

const API_KEY = 'YOUR_SCRAPERAPI_KEY';

async function scrapeWithAPI(targetUrl) {
    const { data } = await axios.get('http://api.scraperapi.com', {
        params: {
            api_key: API_KEY,
            url: targetUrl,
            render: true
        }
    });

    const $ = cheerio.load(data);
    return $('body').text();
}

2. Playwright - Browser Automation

For JavaScript-rendered sites, Playwright provides full browser control.

npm install playwright

import { chromium } from 'playwright';

async function scrapeSPA(url) {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    await page.goto(url);
    await page.waitForSelector('.product-card');

    const products = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.product-card')).map(card => ({
            name: card.querySelector('.name')?.textContent?.trim(),
            price: card.querySelector('.price')?.textContent?.trim()
        }));
    });

    await browser.close();
    return products;
}

Intercepting API Calls with Playwright

import { chromium } from 'playwright';

async function interceptAPIs(url) {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    const apiResponses = [];
    page.on('response', async (response) => {
        if (response.url().includes('/api/') && response.status() === 200) {
            try {
                const json = await response.json();
                apiResponses.push({ url: response.url(), data: json });
            } catch {}
        }
    });

    await page.goto(url);
    await page.waitForLoadState('networkidle');
    await browser.close();

    return apiResponses;
}

3. Crawlee - Full Scraping Framework

Crawlee (by Apify) is a complete scraping framework with request queuing, storage, and error handling.

npm install crawlee

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 100,
    maxConcurrency: 5,

    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const products = [];

        $('.product-card').each((i, el) => {
            products.push({
                name: $(el).find('.name').text().trim(),
                price: $(el).find('.price').text().trim()
            });
        });

        await Dataset.pushData({
            url: request.url,
            title,
            products,
            scrapedAt: new Date().toISOString()
        });

        // Automatically follow pagination links
        await enqueueLinks({
            selector: 'a.next-page',
            label: 'LISTING'
        });
    }
});

await crawler.run(['https://example.com/products']);
console.log('Crawl complete');

Which Tool to Choose

Tool	Best For	JS Rendering	Speed
Cheerio	Static HTML, simple scraping	No	Very fast
Playwright	SPAs, JS-heavy sites, complex interactions	Yes	Slow
Crawlee	Large-scale crawling with queue management	Both	Fast

For protected sites, combine any of these with ScraperAPI for proxy rotation and anti-bot bypass. Node.js excels at I/O-bound scraping thanks to its event-driven architecture, making it ideal for high-concurrency HTTP scraping.