Dockerizing Your Web Scraper - Deployment

Learn how to containerize your Python web scraper with Docker for consistent, portable deployment anywhere.

Docker packages your scraper with all its dependencies into a container that runs identically everywhere, on your laptop, a VPS, or in the cloud.

Why Docker for Scrapers?

Consistent environment across dev and production
Easy deployment to any cloud provider
Dependency isolation (no conflicts with system packages)
Simple horizontal scaling (run multiple containers)

Basic Scraper Dockerfile

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies first (better caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy scraper code
COPY . .

# Create data directory
RUN mkdir -p /app/data

CMD ["python", "main.py"]

# requirements.txt
requests==2.32.3
beautifulsoup4==4.12.3
lxml==5.2.2

# main.py
import requests
from bs4 import BeautifulSoup
import json
import time
import os

def scrape():
    url = "https://news.ycombinator.com"
    resp = requests.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "html.parser")

    items = []
    for link in soup.select(".titleline > a"):
        items.append({"title": link.text, "url": link.get("href", "")})

    output_path = "/app/data/latest.json"
    with open(output_path, "w") as f:
        json.dump(items, f, indent=2)
    print(f"Scraped {len(items)} items")

if __name__ == "__main__":
    while True:
        try:
            scrape()
        except Exception as e:
            print(f"Error: {e}")
        time.sleep(3600)

Build and Run

# Build the image
docker build -t my-scraper .

# Run with a volume for persistent data
docker run -d \
    --name scraper \
    -v $(pwd)/data:/app/data \
    --restart unless-stopped \
    my-scraper

# View logs
docker logs -f scraper

Dockerfile for Playwright Scrapers

Browser-based scrapers need a larger image with Chromium:

# Dockerfile.playwright
FROM mcr.microsoft.com/playwright/python:v1.44.0-jammy

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "main.py"]

# requirements.txt
playwright==1.44.0
playwright-stealth==1.0.6

Docker Compose for Multi-Service Setup

# docker-compose.yml
version: "3.8"

services:
  scraper:
    build: .
    restart: unless-stopped
    volumes:
      - ./data:/app/data
    environment:
      - PROXY_URL=http://user:pass@proxy.example.com:8080
      - SCRAPE_INTERVAL=3600

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: scraper
      POSTGRES_USER: scraper
      POSTGRES_PASSWORD: secret
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

docker compose up -d
docker compose logs -f scraper

Multi-Stage Build (Smaller Image)

FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
CMD ["python", "main.py"]

Tips

Use .dockerignore to exclude venv/, .git/, and data/ from the build context
Pin dependency versions in requirements.txt for reproducible builds
Use --restart unless-stopped for automatic recovery after crashes
Mount volumes for data persistence; container filesystems are ephemeral
Use environment variables for configuration (proxy URLs, API keys)