Puppeteer Basics for Web Scraping
Get started with Puppeteer for web scraping in Node.js. Learn to launch headless Chrome, navigate pages, and extract data from dynamic websites.
Puppeteer is a Node.js library developed by the Chrome team that provides a high-level API to control headless Chrome or Chromium. It is the original headless browser automation tool that inspired Playwright, and it remains popular in the JavaScript ecosystem for web scraping and testing.
Installation
npm install puppeteer
This downloads Chromium automatically. For a smaller install without bundled Chromium:
npm install puppeteer-core
Basic Scraping Example
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/', {
waitUntil: 'networkidle2'
});
const quotes = await page.evaluate(() => {
const elements = document.querySelectorAll('.quote');
return Array.from(elements).map(el => ({
text: el.querySelector('.text').innerText,
author: el.querySelector('.author').innerText
}));
});
quotes.forEach(q => console.log(`${q.text}, ${q.author}`));
await browser.close();
})();
Key Methods
| Method | Purpose |
|---|---|
page.goto(url) |
Navigate to a URL |
page.waitForSelector(sel) |
Wait for element to appear |
page.$(sel) |
Find one element (returns ElementHandle) |
page.$$(sel) |
Find all matching elements |
page.evaluate(fn) |
Run JavaScript in the page context |
page.click(sel) |
Click an element |
page.type(sel, text) |
Type text into an input |
page.screenshot() |
Take a screenshot |
Extracting Data with page.evaluate
The page.evaluate method runs a function inside the browser context. This is the primary way to extract data:
// Extract all links from a page
const links = await page.evaluate(() => {
return Array.from(document.querySelectorAll('a')).map(a => ({
text: a.innerText.trim(),
href: a.href
}));
});
Handling Navigation and Waiting
// Click a link and wait for navigation
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle2' }),
page.click('a.next-page')
]);
// Wait for a specific element
await page.waitForSelector('.results-loaded');
// Wait for a function to return true
await page.waitForFunction(
() => document.querySelectorAll('.item').length > 10
);
Setting a Custom User Agent
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) ' +
'Chrome/120.0.0.0 Safari/537.36'
);
Using Proxies
const browser = await puppeteer.launch({
headless: 'new',
args: ['--proxy-server=http://proxy-server.com:8080']
});
When to Choose Puppeteer
Puppeteer is a solid choice if you are already in the Node.js ecosystem. However, it only supports Chromium. If you need Firefox or WebKit support, or if you want a more modern API with better auto-waiting, consider Playwright instead.
Managed Alternative
For large-scale scraping without managing browser instances, ScraperAPI and ScrapingAnt offer REST APIs that return rendered HTML. This eliminates the need to run Puppeteer in production.
Next Steps
- Compare Puppeteer vs Playwright vs Selenium
- Learn headless vs headed browser scraping
- Intercept network requests for direct data access