Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Puppeteer Basics for Web Scraping

Get started with Puppeteer for web scraping in Node.js. Learn to launch headless Chrome, navigate pages, and extract data from dynamic websites.

Browser Automation · #12beginner3 min read
Share:WhatsAppLinkedIn

Puppeteer is a Node.js library developed by the Chrome team that provides a high-level API to control headless Chrome or Chromium. It is the original headless browser automation tool that inspired Playwright, and it remains popular in the JavaScript ecosystem for web scraping and testing.

Installation

npm install puppeteer

This downloads Chromium automatically. For a smaller install without bundled Chromium:

npm install puppeteer-core

Basic Scraping Example

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto('https://quotes.toscrape.com/js/', {
    waitUntil: 'networkidle2'
  });

  const quotes = await page.evaluate(() => {
    const elements = document.querySelectorAll('.quote');
    return Array.from(elements).map(el => ({
      text: el.querySelector('.text').innerText,
      author: el.querySelector('.author').innerText
    }));
  });

  quotes.forEach(q => console.log(`${q.text}, ${q.author}`));

  await browser.close();
})();

Key Methods

Method Purpose
page.goto(url) Navigate to a URL
page.waitForSelector(sel) Wait for element to appear
page.$(sel) Find one element (returns ElementHandle)
page.$$(sel) Find all matching elements
page.evaluate(fn) Run JavaScript in the page context
page.click(sel) Click an element
page.type(sel, text) Type text into an input
page.screenshot() Take a screenshot

Extracting Data with page.evaluate

The page.evaluate method runs a function inside the browser context. This is the primary way to extract data:

// Extract all links from a page
const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => ({
    text: a.innerText.trim(),
    href: a.href
  }));
});

Handling Navigation and Waiting

// Click a link and wait for navigation
await Promise.all([
  page.waitForNavigation({ waitUntil: 'networkidle2' }),
  page.click('a.next-page')
]);

// Wait for a specific element
await page.waitForSelector('.results-loaded');

// Wait for a function to return true
await page.waitForFunction(
  () => document.querySelectorAll('.item').length > 10
);

Setting a Custom User Agent

await page.setUserAgent(
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
  'AppleWebKit/537.36 (KHTML, like Gecko) ' +
  'Chrome/120.0.0.0 Safari/537.36'
);

Using Proxies

const browser = await puppeteer.launch({
  headless: 'new',
  args: ['--proxy-server=http://proxy-server.com:8080']
});

When to Choose Puppeteer

Puppeteer is a solid choice if you are already in the Node.js ecosystem. However, it only supports Chromium. If you need Firefox or WebKit support, or if you want a more modern API with better auto-waiting, consider Playwright instead.

Managed Alternative

For large-scale scraping without managing browser instances, ScraperAPI and ScrapingAnt offer REST APIs that return rendered HTML. This eliminates the need to run Puppeteer in production.

Next Steps

  • Compare Puppeteer vs Playwright vs Selenium
  • Learn headless vs headed browser scraping
  • Intercept network requests for direct data access