Dockerizing Your Web Scraper
Learn how to containerize your Python web scraper with Docker for consistent, portable deployment anywhere.
Deployment · #3intermediate3 min read
Docker packages your scraper with all its dependencies into a container that runs identically everywhere, on your laptop, a VPS, or in the cloud.
Why Docker for Scrapers?
- Consistent environment across dev and production
- Easy deployment to any cloud provider
- Dependency isolation (no conflicts with system packages)
- Simple horizontal scaling (run multiple containers)
Basic Scraper Dockerfile
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install dependencies first (better caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy scraper code
COPY . .
# Create data directory
RUN mkdir -p /app/data
CMD ["python", "main.py"]
# requirements.txt
requests==2.32.3
beautifulsoup4==4.12.3
lxml==5.2.2
# main.py
import requests
from bs4 import BeautifulSoup
import json
import time
import os
def scrape():
url = "https://news.ycombinator.com"
resp = requests.get(url, timeout=30)
soup = BeautifulSoup(resp.text, "html.parser")
items = []
for link in soup.select(".titleline > a"):
items.append({"title": link.text, "url": link.get("href", "")})
output_path = "/app/data/latest.json"
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
print(f"Scraped {len(items)} items")
if __name__ == "__main__":
while True:
try:
scrape()
except Exception as e:
print(f"Error: {e}")
time.sleep(3600)
Build and Run
# Build the image
docker build -t my-scraper .
# Run with a volume for persistent data
docker run -d \
--name scraper \
-v $(pwd)/data:/app/data \
--restart unless-stopped \
my-scraper
# View logs
docker logs -f scraper
Dockerfile for Playwright Scrapers
Browser-based scrapers need a larger image with Chromium:
# Dockerfile.playwright
FROM mcr.microsoft.com/playwright/python:v1.44.0-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
# requirements.txt
playwright==1.44.0
playwright-stealth==1.0.6
Docker Compose for Multi-Service Setup
# docker-compose.yml
version: "3.8"
services:
scraper:
build: .
restart: unless-stopped
volumes:
- ./data:/app/data
environment:
- PROXY_URL=http://user:pass@proxy.example.com:8080
- SCRAPE_INTERVAL=3600
redis:
image: redis:7-alpine
ports:
- "6379:6379"
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: scraper
POSTGRES_USER: scraper
POSTGRES_PASSWORD: secret
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
docker compose up -d
docker compose logs -f scraper
Multi-Stage Build (Smaller Image)
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
CMD ["python", "main.py"]
Tips
- Use
.dockerignoreto excludevenv/,.git/, anddata/from the build context - Pin dependency versions in
requirements.txtfor reproducible builds - Use
--restart unless-stoppedfor automatic recovery after crashes - Mount volumes for data persistence; container filesystems are ephemeral
- Use environment variables for configuration (proxy URLs, API keys)