Scraping with Selenium Grid
Learn to set up and use Selenium Grid for distributed, parallel web scraping across multiple machines and browser instances.
Selenium Grid lets you run browser automation across multiple machines. Instead of running everything on a single computer, you distribute scraping tasks to a pool of browser nodes. This is the standard approach for scaling Selenium-based scraping beyond what a single machine can handle.
How Selenium Grid Works
Selenium Grid has two components:
- Hub: The central server that receives requests and routes them to available nodes
- Nodes: Machines running browser instances that execute the actual scraping
Your scraper connects to the Hub, which assigns work to available Nodes.
Quick Setup with Docker
The fastest way to run Selenium Grid is with Docker Compose:
# docker-compose.yml
version: "3"
services:
selenium-hub:
image: selenium/hub:4.15.0
ports:
- "4442:4442"
- "4443:4443"
- "4444:4444"
chrome-node:
image: selenium/node-chrome:4.15.0
depends_on:
- selenium-hub
environment:
- SE_EVENT_BUS_HOST=selenium-hub
- SE_EVENT_BUS_PUBLISH_PORT=4442
- SE_EVENT_BUS_SUBSCRIBE_PORT=4443
- SE_NODE_MAX_SESSIONS=4
shm_size: "2gb"
firefox-node:
image: selenium/node-firefox:4.15.0
depends_on:
- selenium-hub
environment:
- SE_EVENT_BUS_HOST=selenium-hub
- SE_EVENT_BUS_PUBLISH_PORT=4442
- SE_EVENT_BUS_SUBSCRIBE_PORT=4443
- SE_NODE_MAX_SESSIONS=4
shm_size: "2gb"
Start the grid:
docker-compose up -d --scale chrome-node=3
This creates a hub and three Chrome nodes, each capable of running four concurrent sessions.
Connecting Your Scraper to the Grid
from selenium import webdriver
from selenium.webdriver.common.by import By
# Connect to the Selenium Grid hub
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Remote(
command_executor="http://localhost:4444/wd/hub",
options=options
)
try:
driver.get("https://quotes.toscrape.com")
driver.implicitly_wait(10)
quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
for quote in quotes:
text = quote.find_element(By.CSS_SELECTOR, ".text").text
author = quote.find_element(By.CSS_SELECTOR, ".author").text
print(f"{text}, {author}")
finally:
driver.quit()
Parallel Scraping with Selenium Grid
Use threading to send multiple requests to the Grid simultaneously:
from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
from selenium.webdriver.common.by import By
GRID_URL = "http://localhost:4444/wd/hub"
def scrape_page(url):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Remote(
command_executor=GRID_URL,
options=options
)
try:
driver.get(url)
driver.implicitly_wait(10)
title = driver.title
quotes = driver.find_elements(By.CSS_SELECTOR, ".quote .text")
texts = [q.text for q in quotes]
return {"url": url, "title": title, "quotes": len(texts)}
except Exception as e:
return {"url": url, "error": str(e)}
finally:
driver.quit()
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
with ThreadPoolExecutor(max_workers=6) as executor:
futures = {executor.submit(scrape_page, url): url for url in urls}
for future in as_completed(futures):
result = future.result()
if "error" in result:
print(f"FAILED: {result['url']}, {result['error']}")
else:
print(f"OK: {result['url']}, {result['quotes']} quotes")
Monitoring the Grid
Selenium Grid 4 provides a web dashboard at http://localhost:4444/ui. You can see active sessions, node status, and queue information.
You can also query the Grid status via API:
curl http://localhost:4444/status
Scaling Considerations
- Each browser session uses roughly 200-500MB of RAM
- Set
SE_NODE_MAX_SESSIONSbased on your node's available memory - Use
shm_size: "2gb"in Docker to avoid crashes from shared memory limits - Consider Kubernetes for auto-scaling node pools
Managed Alternatives
Running Selenium Grid requires maintaining infrastructure, managing Docker containers, and handling node failures. ScraperAPI and ScrapingAnt provide equivalent distributed scraping capability through simple API calls, with no infrastructure to maintain. They handle browser management, scaling, and proxy rotation automatically.
Next Steps
- Learn browser automation anti-detection techniques
- Compare Playwright vs Selenium vs Puppeteer
- Explore parallel scraping with Playwright async