Scraping with Selenium Grid - Browser Automation

Learn to set up and use Selenium Grid for distributed, parallel web scraping across multiple machines and browser instances.

Selenium Grid lets you run browser automation across multiple machines. Instead of running everything on a single computer, you distribute scraping tasks to a pool of browser nodes. This is the standard approach for scaling Selenium-based scraping beyond what a single machine can handle.

How Selenium Grid Works

Selenium Grid has two components:

Hub: The central server that receives requests and routes them to available nodes
Nodes: Machines running browser instances that execute the actual scraping

Your scraper connects to the Hub, which assigns work to available Nodes.

Quick Setup with Docker

The fastest way to run Selenium Grid is with Docker Compose:

# docker-compose.yml
version: "3"
services:
  selenium-hub:
    image: selenium/hub:4.15.0
    ports:
      - "4442:4442"
      - "4443:4443"
      - "4444:4444"

  chrome-node:
    image: selenium/node-chrome:4.15.0
    depends_on:
      - selenium-hub
    environment:
      - SE_EVENT_BUS_HOST=selenium-hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
      - SE_NODE_MAX_SESSIONS=4
    shm_size: "2gb"

  firefox-node:
    image: selenium/node-firefox:4.15.0
    depends_on:
      - selenium-hub
    environment:
      - SE_EVENT_BUS_HOST=selenium-hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
      - SE_NODE_MAX_SESSIONS=4
    shm_size: "2gb"

Start the grid:

docker-compose up -d --scale chrome-node=3

This creates a hub and three Chrome nodes, each capable of running four concurrent sessions.

Connecting Your Scraper to the Grid

from selenium import webdriver
from selenium.webdriver.common.by import By

# Connect to the Selenium Grid hub
options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Remote(
    command_executor="http://localhost:4444/wd/hub",
    options=options
)

try:
    driver.get("https://quotes.toscrape.com")
    driver.implicitly_wait(10)

    quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
    for quote in quotes:
        text = quote.find_element(By.CSS_SELECTOR, ".text").text
        author = quote.find_element(By.CSS_SELECTOR, ".author").text
        print(f"{text}, {author}")
finally:
    driver.quit()

Parallel Scraping with Selenium Grid

Use threading to send multiple requests to the Grid simultaneously:

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
from selenium.webdriver.common.by import By

GRID_URL = "http://localhost:4444/wd/hub"

def scrape_page(url):
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")

    driver = webdriver.Remote(
        command_executor=GRID_URL,
        options=options
    )

    try:
        driver.get(url)
        driver.implicitly_wait(10)
        title = driver.title

        quotes = driver.find_elements(By.CSS_SELECTOR, ".quote .text")
        texts = [q.text for q in quotes]

        return {"url": url, "title": title, "quotes": len(texts)}
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        driver.quit()

urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]

with ThreadPoolExecutor(max_workers=6) as executor:
    futures = {executor.submit(scrape_page, url): url for url in urls}
    for future in as_completed(futures):
        result = future.result()
        if "error" in result:
            print(f"FAILED: {result['url']}, {result['error']}")
        else:
            print(f"OK: {result['url']}, {result['quotes']} quotes")

Monitoring the Grid

Selenium Grid 4 provides a web dashboard at http://localhost:4444/ui. You can see active sessions, node status, and queue information.

You can also query the Grid status via API:

curl http://localhost:4444/status

Scaling Considerations

Each browser session uses roughly 200-500MB of RAM
Set SE_NODE_MAX_SESSIONS based on your node's available memory
Use shm_size: "2gb" in Docker to avoid crashes from shared memory limits
Consider Kubernetes for auto-scaling node pools

Managed Alternatives

Running Selenium Grid requires maintaining infrastructure, managing Docker containers, and handling node failures. ScraperAPI and ScrapingAnt provide equivalent distributed scraping capability through simple API calls, with no infrastructure to maintain. They handle browser management, scaling, and proxy rotation automatically.

Next Steps

Learn browser automation anti-detection techniques
Compare Playwright vs Selenium vs Puppeteer
Explore parallel scraping with Playwright async