Tutorial
Web Scraping with Node.js - Cheerio, Playwright, Crawlee
Learn web scraping with Node.js using Cheerio for HTML parsing, Playwright for browser automation, and Crawlee for full scraping pipelines.
Node.js is a strong choice for web scraping thanks to its async nature and the ecosystem of scraping libraries. Here are the three main approaches.
1. Cheerio - Fast HTML Parsing
Cheerio provides a jQuery-like API for parsing HTML. It is fast and lightweight but does not execute JavaScript.
npm install cheerio axios
import axios from 'axios';
import * as cheerio from 'cheerio';
async function scrapeProducts(url) {
const { data } = await axios.get(url);
const $ = cheerio.load(data);
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.product-name').text().trim(),
price: $(el).find('.product-price').text().trim(),
url: $(el).find('a').attr('href'),
image: $(el).find('img').attr('src')
});
});
return products;
}
const products = await scrapeProducts('https://example.com/products');
console.log(`Found ${products.length} products`);
Using Cheerio with ScraperAPI
import axios from 'axios';
import * as cheerio from 'cheerio';
const API_KEY = 'YOUR_SCRAPERAPI_KEY';
async function scrapeWithAPI(targetUrl) {
const { data } = await axios.get('http://api.scraperapi.com', {
params: {
api_key: API_KEY,
url: targetUrl,
render: true
}
});
const $ = cheerio.load(data);
return $('body').text();
}
2. Playwright - Browser Automation
For JavaScript-rendered sites, Playwright provides full browser control.
npm install playwright
import { chromium } from 'playwright';
async function scrapeSPA(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url);
await page.waitForSelector('.product-card');
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.name')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim()
}));
});
await browser.close();
return products;
}
Intercepting API Calls with Playwright
import { chromium } from 'playwright';
async function interceptAPIs(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
const apiResponses = [];
page.on('response', async (response) => {
if (response.url().includes('/api/') && response.status() === 200) {
try {
const json = await response.json();
apiResponses.push({ url: response.url(), data: json });
} catch {}
}
});
await page.goto(url);
await page.waitForLoadState('networkidle');
await browser.close();
return apiResponses;
}
3. Crawlee - Full Scraping Framework
Crawlee (by Apify) is a complete scraping framework with request queuing, storage, and error handling.
npm install crawlee
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
maxConcurrency: 5,
async requestHandler({ request, $, enqueueLinks }) {
const title = $('h1').text().trim();
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.name').text().trim(),
price: $(el).find('.price').text().trim()
});
});
await Dataset.pushData({
url: request.url,
title,
products,
scrapedAt: new Date().toISOString()
});
// Automatically follow pagination links
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING'
});
}
});
await crawler.run(['https://example.com/products']);
console.log('Crawl complete');
Which Tool to Choose
| Tool | Best For | JS Rendering | Speed |
|---|---|---|---|
| Cheerio | Static HTML, simple scraping | No | Very fast |
| Playwright | SPAs, JS-heavy sites, complex interactions | Yes | Slow |
| Crawlee | Large-scale crawling with queue management | Both | Fast |
For protected sites, combine any of these with ScraperAPI for proxy rotation and anti-bot bypass. Node.js excels at I/O-bound scraping thanks to its event-driven architecture, making it ideal for high-concurrency HTTP scraping.