Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Parallel Browser Scraping

Learn to run multiple browser instances in parallel for high-speed web scraping using Playwright async API and Selenium with threading.

Browser Automation · #17advanced3 min read
Share:WhatsAppLinkedIn

Scraping pages one at a time is slow. If you need to scrape hundreds or thousands of pages, running multiple browser instances or pages in parallel can dramatically speed up your pipeline. Playwright's async API and Selenium with threading or multiprocessing make this possible.

Playwright Async: Multiple Pages in Parallel

Playwright's async API with asyncio is the cleanest way to scrape in parallel:

import asyncio
from playwright.async_api import async_playwright

async def scrape_page(browser, url):
    page = await browser.new_page()
    try:
        await page.goto(url, timeout=30000)
        await page.wait_for_selector("body")
        title = await page.title()
        return {"url": url, "title": title}
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        await page.close()

async def main():
    urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        # Scrape all pages concurrently
        tasks = [scrape_page(browser, url) for url in urls]
        results = await asyncio.gather(*tasks)

        for result in results:
            print(result)

        await browser.close()

asyncio.run(main())

Controlling Concurrency with Semaphores

Running too many pages in parallel can overwhelm your system or trigger rate limits. Use a semaphore to limit concurrency:

import asyncio
from playwright.async_api import async_playwright

MAX_CONCURRENT = 5

async def scrape_page(semaphore, browser, url):
    async with semaphore:
        page = await browser.new_page()
        try:
            await page.goto(url, timeout=30000)
            await page.wait_for_selector("body")
            title = await page.title()
            content = await page.inner_text("body")
            return {"url": url, "title": title, "length": len(content)}
        except Exception as e:
            return {"url": url, "error": str(e)}
        finally:
            await page.close()

async def main():
    urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 51)]
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        tasks = [scrape_page(semaphore, browser, url) for url in urls]
        results = await asyncio.gather(*tasks)

        successful = [r for r in results if "error" not in r]
        print(f"Successfully scraped {len(successful)}/{len(urls)} pages")

        await browser.close()

asyncio.run(main())

Selenium with ThreadPoolExecutor

Selenium does not have an async API, but you can use threading:

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def scrape_url(url):
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=options)

    try:
        driver.get(url)
        driver.implicitly_wait(5)
        title = driver.title
        return {"url": url, "title": title}
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        driver.quit()

urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(scrape_url, url): url for url in urls}
    for future in as_completed(futures):
        result = future.result()
        print(result)

Multiple Browser Contexts (Lightweight Parallelism)

Instead of separate browser instances, use contexts within one browser for lower memory usage:

async def scrape_with_context(browser, url):
    context = await browser.new_context()
    page = await context.new_page()
    try:
        await page.goto(url)
        return await page.title()
    finally:
        await context.close()

Performance Tips

  • Block unnecessary resources (images, CSS, fonts) to reduce bandwidth and speed up page loads
  • Reuse browser instances instead of launching new ones for each URL
  • Use contexts instead of separate browsers when proxy configuration is the same
  • Set timeouts to prevent slow pages from blocking the entire pipeline
  • Monitor memory as each browser page consumes significant RAM

Scaling Beyond a Single Machine

For truly large-scale parallel scraping, consider ScraperAPI which handles thousands of concurrent requests across their proxy pool. ScrapingAnt similarly offers high-concurrency scraping without you needing to manage browser processes at all.

Next Steps

  • Scrape with Playwright in Python at scale
  • Set up Selenium Grid for distributed scraping
  • Learn anti-detection techniques for parallel scraping