CI/CD for Scrapers (GitHub Actions for Python and PHP), Production, Scale & Career

Automated test, build, and deploy pipelines for scraping projects. The pipeline that catches selector-breakage before it hits production.

Code changes break scrapers in two ways: bugs in your logic, and accidental drift in target HTML expectations. A good CI pipeline catches both before they hit production.

GitHub Actions is the default; the patterns transfer to GitLab CI, CircleCI, Jenkins, and others.

Pipeline shape

push / PR  →  lint  →  unit tests  →  fixture tests  →  build image  →  push to registry
  ↓
  deploy to staging
  ↓
  manual approval → prod

Lint and unit are seconds. Fixture tests are tens of seconds. Build + push are minutes. Stage + prod are policy gates.

Python workflow

.github/workflows/scraper.yml:

name: scraper

on:
  push: {branches: [main]}
  pull_request:

jobs:
  test:
  runs-on: ubuntu-latest
  steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-python@v5
  with: {python-version: "3.12", cache: "pip"}
  - run: pip install -r requirements.txt -r requirements-dev.txt
  - run: ruff check .
  - run: ruff format --check .
  - run: mypy scraper/
  - run: pytest tests/ --maxfail=3 -q

  build:
  needs: test
  if: github.ref == 'refs/heads/main'
  runs-on: ubuntu-latest
  steps:
  - uses: actions/checkout@v4
  - uses: docker/setup-buildx-action@v3
  - uses: docker/login-action@v3
  with:
  registry: ghcr.io
  username: ${{ github.actor }}
  password: ${{ secrets.GITHUB_TOKEN }}
  - uses: docker/build-push-action@v5
  with:
  context: .
  platforms: linux/amd64,linux/arm64
  push: true
  tags: |
  ghcr.io/${{ github.repository }}/scraper:${{ github.sha }}
  ghcr.io/${{ github.repository }}/scraper:latest
  cache-from: type=gha
  cache-to: type=gha,mode=max

  deploy-staging:
  needs: build
  if: github.ref == 'refs/heads/main'
  runs-on: ubuntu-latest
  steps:
  - run: |
  curl -X POST -H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
  "https://deploy.internal/staging/scraper?tag=${{ github.sha }}"

  deploy-prod:
  needs: deploy-staging
  environment: production  # GitHub Environments → manual approval required
  runs-on: ubuntu-latest
  steps:
  - run: |
  curl -X POST -H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
  "https://deploy.internal/prod/scraper?tag=${{ github.sha }}"

Key features:

Cached pip dependencies speed reruns.
Multi-arch build (amd64 + arm64).
GHA build cache dramatically speeds Docker builds (CI typically goes from 5 min → 30s on no-deps-change).
GitHub Environments gate prod with required reviewers.

PHP / Symfony workflow

name: scraper-php

on:
  push: {branches: [main]}
  pull_request:

jobs:
  test:
  runs-on: ubuntu-latest
  services:
  postgres:
  image: postgres:16
  env: {POSTGRES_PASSWORD: ci}
  ports: ["5432:5432"]
  options: --health-cmd pg_isready --health-interval 5s
  steps:
  - uses: actions/checkout@v4
  - uses: shivammathur/setup-php@v2
  with:
  php-version: "8.3"
  extensions: pdo_pgsql, intl, zip
  tools: composer
  coverage: none
  - run: composer install --prefer-dist --no-progress
  - run: vendor/bin/phpstan analyse src
  - run: vendor/bin/php-cs-fixer check
  - run: php bin/phpunit
  env:
  DATABASE_URL: postgresql://postgres:ci@localhost:5432/postgres

  build:
  needs: test
  if: github.ref == 'refs/heads/main'
  # ... same as Python

GitHub Actions's services: block runs Postgres as a sidecar, your tests get a real DB.

Fixture-based target tests

The most valuable scraper test isn't a unit test, it's: "given this saved HTML, the parser extracts these items."

# tests/fixtures/catalog108_product_2026_05_12.html  ← saved snapshot
# tests/test_parser.py
def test_product_parser():
  html = (Path(__file__).parent / "fixtures/catalog108_product_2026_05_12.html").read_text()
  items = list(parse_product(html))
  assert items == [{
  "title": "Stainless Blender",
  "price_cents": 4999,
  "in_stock": True,
  "url": "https://practice.scrapingcentral.com/products/1042"
  }]

When you change parsing logic, this test guards against regressions. When the target changes, the test fails on the next fixture refresh, caught locally, not in production.

Refresh fixtures periodically:

# scripts/refresh_fixtures.py
import httpx
for url in FIXTURE_URLS:
  resp = httpx.get(url)
  (FIXTURE_DIR / safe_filename(url)).write_text(resp.text)

Run weekly. PRs that update fixtures + parser together signal a real change.

Smoke tests against staging

After deploy-staging:

smoke:
  needs: deploy-staging
  runs-on: ubuntu-latest
  steps:
  - run: |
  # Hit staging health endpoint
  curl --fail https://staging-scraper.internal/health
  # Run a small live scrape, assert items > 0
  python -m scraper.smoke --target staging

If the live smoke fails, the pipeline halts before promoting to prod.

Secrets in GitHub Actions

Use ${{ secrets.NAME }}. Never echo them into logs (mask is automatic but only for exact matches). For production secrets, prefer OIDC federation with your cloud provider over long-lived tokens.

Reproducibility from CI

A CI build should be reproducible, same input, same output. Things that break this:

Pulling unpinned base images (python:3.12 may drift). Pin to digests for max reproducibility: python:3.12@sha256:....
Latest dependencies on each install. Use a lockfile (requirements.txt from pip-compile, composer.lock).
System time / random / network. Avoid these in builds; test setup should mock them.

Speeding up CI

Optimization	Typical gain
Cache pip / composer deps	30–60s
Cache Docker layers (GHA cache)	2–5min
Run lint/test/build in parallel jobs	up to 3x wall-time
Skip CI on docs-only changes	100%
Self-hosted runners for big builds	2–3x

A well-tuned scraper CI completes in 3–5 minutes for a typical commit.

What to try

Set up the Python workflow above on your Catalog108 scraper. Then deliberately break a CSS selector and commit. The fixture test should fail loud in CI. Fix it; the PR turns green. That's the loop you want every day.

CI/CD for Scrapers (GitHub Actions for Python and PHP)

What you’ll learn