Puppeteer Basics for Web Scraping

Get started with Puppeteer for web scraping in Node.js. Learn to launch headless Chrome, navigate pages, and extract data from dynamic websites.

Puppeteer is a Node.js library developed by the Chrome team that provides a high-level API to control headless Chrome or Chromium. It is the original headless browser automation tool that inspired Playwright, and it remains popular in the JavaScript ecosystem for web scraping and testing.

Installation

npm install puppeteer

This downloads Chromium automatically. For a smaller install without bundled Chromium:

npm install puppeteer-core

Basic Scraping Example

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto('https://quotes.toscrape.com/js/', {
    waitUntil: 'networkidle2'
  });

  const quotes = await page.evaluate(() => {
    const elements = document.querySelectorAll('.quote');
    return Array.from(elements).map(el => ({
      text: el.querySelector('.text').innerText,
      author: el.querySelector('.author').innerText
    }));
  });

  quotes.forEach(q => console.log(`${q.text}, ${q.author}`));

  await browser.close();
})();

Key Methods

Method	Purpose
`page.goto(url)`	Navigate to a URL
`page.waitForSelector(sel)`	Wait for element to appear
`page.$(sel)`	Find one element (returns ElementHandle)
`page.$$(sel)`	Find all matching elements
`page.evaluate(fn)`	Run JavaScript in the page context
`page.click(sel)`	Click an element
`page.type(sel, text)`	Type text into an input
`page.screenshot()`	Take a screenshot

Extracting Data with `page.evaluate`

The page.evaluate method runs a function inside the browser context. This is the primary way to extract data:

// Extract all links from a page
const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => ({
    text: a.innerText.trim(),
    href: a.href
  }));
});

Handling Navigation and Waiting

// Click a link and wait for navigation
await Promise.all([
  page.waitForNavigation({ waitUntil: 'networkidle2' }),
  page.click('a.next-page')
]);

// Wait for a specific element
await page.waitForSelector('.results-loaded');

// Wait for a function to return true
await page.waitForFunction(
  () => document.querySelectorAll('.item').length > 10
);

Setting a Custom User Agent

await page.setUserAgent(
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
  'AppleWebKit/537.36 (KHTML, like Gecko) ' +
  'Chrome/120.0.0.0 Safari/537.36'
);

Using Proxies

const browser = await puppeteer.launch({
  headless: 'new',
  args: ['--proxy-server=http://proxy-server.com:8080']
});

When to Choose Puppeteer

Puppeteer is a solid choice if you are already in the Node.js ecosystem. However, it only supports Chromium. If you need Firefox or WebKit support, or if you want a more modern API with better auto-waiting, consider Playwright instead.

Managed Alternative

For large-scale scraping without managing browser instances, ScraperAPI and ScrapingAnt offer REST APIs that return rendered HTML. This eliminates the need to run Puppeteer in production.

Next Steps

Compare Puppeteer vs Playwright vs Selenium
Learn headless vs headed browser scraping
Intercept network requests for direct data access