Dockerizing Python Scrapers
A reproducible Docker image is the unit of deployment for a modern scraper. Multi-stage builds, slim base images, and the runtime surface area you actually need.
What you’ll learn
- Write a production-grade Dockerfile for a Python scraper.
- Use multi-stage builds to shrink the image.
- Handle dependencies, secrets, and Playwright/browser binaries cleanly.
A scraper that works on your laptop and breaks on the server is the most common production failure. Docker fixes this by shipping a single immutable artifact: code + interpreter + libs + system packages, identical everywhere.
This lesson is the right Dockerfile for a Python scraper, not the one auto-generated by tooling, the one written by someone who's done it ten times.
The naive Dockerfile (anti-pattern)
FROM python:3.12
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "scraper.py"]
Three problems: the base image is ~1GB, every code change reinstalls dependencies (no layer cache), and dev tools ship to production.
The production Dockerfile
# Stage 1: build dependencies
FROM python:3.12-slim AS builder
# System deps needed only for building wheels
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libxml2-dev libxslt-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
# Install deps in a virtualenv we can copy across stages
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Stage 2: minimal runtime
FROM python:3.12-slim
# Runtime-only system deps
RUN apt-get update && apt-get install -y --no-install-recommends \
libxml2 libxslt1.1 ca-certificates \
tini \
&& rm -rf /var/lib/apt/lists/*
# Copy the virtualenv from the builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1 PYTHONDONTWRITEBYTECODE=1
# Non-root user
RUN useradd -m -u 1000 scraper
USER scraper
WORKDIR /home/scraper/app
COPY --chown=scraper:scraper . .
# tini handles signals cleanly
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["python", "-m", "scraper"]
What's worth noticing:
- Multi-stage build. Compile-time deps (build-essential, dev headers) live only in the builder stage. The runtime image doesn't carry them. Image typically drops from 1.2GB to ~250MB.
python:3.12-sliminstead ofpython:3.12. Saves ~700MB on its own.requirements.txtcopied before code. Code changes invalidate only the final COPY layer, not the dependency install.- Non-root user. Don't run scrapers as root in containers, even if the container is "sandboxed."
tinias PID 1. Handles signal forwarding (Ctrl+C, SIGTERM from orchestrators) cleanly. Python doesn't do this well alone.PYTHONUNBUFFERED=1so stdout streams immediately (essential for Docker log capture).
Layer caching, order matters
Docker builds bottom-up but caches per layer. Each layer's cache is invalidated if its inputs change. Put slow-changing layers first:
# Slow-changing: base + system deps
FROM python:3.12-slim
RUN apt-get install -y ...
# Medium: Python deps
COPY requirements.txt .
RUN pip install ...
# Fast-changing: code
COPY . .
Editing one line of code rebuilds only the last layer in a few seconds. Editing requirements.txt rebuilds the pip-install layer. Editing the Dockerfile itself rebuilds from there down.
Pinning everything
# requirements.txt, pin exact versions
scrapy==2.11.2
httpx==0.27.0
beautifulsoup4==4.12.3
lxml==5.2.2
playwright==1.45.0
For production, use a lockfile from pip-tools or uv:
pip-compile requirements.in --output-file requirements.txt
requirements.in lists your direct deps; requirements.txt is the locked transitive resolution. Same for uv pip compile. Without locks, a transitive dep update can break your build silently.
Playwright in Docker
Playwright needs system libraries for the browsers. Two options:
# Option 1: use Microsoft's official image
FROM mcr.microsoft.com/playwright/python:v1.45.0-jammy
# Already has Chromium, Firefox, WebKit and all system deps
# Option 2: install in your own image
FROM python:3.12-slim
RUN pip install playwright==1.45.0
RUN playwright install --with-deps chromium # downloads browser + apt deps
The Microsoft image is bigger (~2GB) but skips the install dance. For lighter scrapers that only use one browser, building your own image saves space.
Secrets, don't bake them in
# BAD: secrets in the image, visible to anyone with `docker history`
ENV API_KEY=sk-abc123...
# GOOD: pass at runtime
# docker run -e API_KEY=... scraper:latest
# Or use docker secrets / K8s secrets / Vault / SSM
Treat the image as public artifact. Secrets enter at runtime, not build time.
.dockerignore
Mirror your .gitignore plus build artifacts:
.git
__pycache__
*.pyc
.venv
.pytest_cache
*.log
docs/
tests/
.env
Smaller build context = faster builds, smaller images, no accidental leak of .env files.
Healthchecks
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -fsS http://localhost:8000/health || exit 1
If your scraper exposes a /health endpoint (Prometheus often does on the same port), Docker / Kubernetes can use this to detect zombies, process is alive but stuck.
Build and ship
# Local build
docker build -t scraper:dev .
# Multi-arch (amd64 + arm64) for cloud and Apple Silicon
docker buildx build --platform linux/amd64,linux/arm64 \
-t myreg/scraper:1.4.2 \
--push .
Tag images with semver or git SHA. latest is for development. Production deployments reference exact tags so rollbacks are trivial.
What to try
Take your existing Catalog108 scraper. Write the production Dockerfile above. Build it. Compare:
docker images scraper:devsize before vs after multi-stage.- Build time on a code-only change.
- Run the image, hit Ctrl+C, does it shut down cleanly within a second? If not, tini isn't wired in.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.