CI/CD for Scrapers (GitHub Actions for Python and PHP)
Automated test, build, and deploy pipelines for scraping projects. The pipeline that catches selector-breakage before it hits production.
What you’ll learn
- Write a GitHub Actions workflow that tests, builds, and pushes a scraper image.
- Add target-fixture tests that catch HTML changes.
- Deploy to staging on merge to main; gate production behind manual approval.
Code changes break scrapers in two ways: bugs in your logic, and accidental drift in target HTML expectations. A good CI pipeline catches both before they hit production.
GitHub Actions is the default; the patterns transfer to GitLab CI, CircleCI, Jenkins, and others.
Pipeline shape
push / PR → lint → unit tests → fixture tests → build image → push to registry
↓
deploy to staging
↓
manual approval → prod
Lint and unit are seconds. Fixture tests are tens of seconds. Build + push are minutes. Stage + prod are policy gates.
Python workflow
.github/workflows/scraper.yml:
name: scraper
on:
push: {branches: [main]}
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.12", cache: "pip"}
- run: pip install -r requirements.txt -r requirements-dev.txt
- run: ruff check .
- run: ruff format --check .
- run: mypy scraper/
- run: pytest tests/ --maxfail=3 -q
build:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: |
ghcr.io/${{ github.repository }}/scraper:${{ github.sha }}
ghcr.io/${{ github.repository }}/scraper:latest
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: |
curl -X POST -H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
"https://deploy.internal/staging/scraper?tag=${{ github.sha }}"
deploy-prod:
needs: deploy-staging
environment: production # GitHub Environments → manual approval required
runs-on: ubuntu-latest
steps:
- run: |
curl -X POST -H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
"https://deploy.internal/prod/scraper?tag=${{ github.sha }}"
Key features:
- Cached pip dependencies speed reruns.
- Multi-arch build (amd64 + arm64).
- GHA build cache dramatically speeds Docker builds (CI typically goes from 5 min → 30s on no-deps-change).
- GitHub Environments gate prod with required reviewers.
PHP / Symfony workflow
name: scraper-php
on:
push: {branches: [main]}
pull_request:
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env: {POSTGRES_PASSWORD: ci}
ports: ["5432:5432"]
options: --health-cmd pg_isready --health-interval 5s
steps:
- uses: actions/checkout@v4
- uses: shivammathur/setup-php@v2
with:
php-version: "8.3"
extensions: pdo_pgsql, intl, zip
tools: composer
coverage: none
- run: composer install --prefer-dist --no-progress
- run: vendor/bin/phpstan analyse src
- run: vendor/bin/php-cs-fixer check
- run: php bin/phpunit
env:
DATABASE_URL: postgresql://postgres:ci@localhost:5432/postgres
build:
needs: test
if: github.ref == 'refs/heads/main'
# ... same as Python
GitHub Actions's services: block runs Postgres as a sidecar, your tests get a real DB.
Fixture-based target tests
The most valuable scraper test isn't a unit test, it's: "given this saved HTML, the parser extracts these items."
# tests/fixtures/catalog108_product_2026_05_12.html ← saved snapshot
# tests/test_parser.py
def test_product_parser():
html = (Path(__file__).parent / "fixtures/catalog108_product_2026_05_12.html").read_text()
items = list(parse_product(html))
assert items == [{
"title": "Stainless Blender",
"price_cents": 4999,
"in_stock": True,
"url": "https://practice.scrapingcentral.com/products/1042"
}]
When you change parsing logic, this test guards against regressions. When the target changes, the test fails on the next fixture refresh, caught locally, not in production.
Refresh fixtures periodically:
# scripts/refresh_fixtures.py
import httpx
for url in FIXTURE_URLS:
resp = httpx.get(url)
(FIXTURE_DIR / safe_filename(url)).write_text(resp.text)
Run weekly. PRs that update fixtures + parser together signal a real change.
Smoke tests against staging
After deploy-staging:
smoke:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- run: |
# Hit staging health endpoint
curl --fail https://staging-scraper.internal/health
# Run a small live scrape, assert items > 0
python -m scraper.smoke --target staging
If the live smoke fails, the pipeline halts before promoting to prod.
Secrets in GitHub Actions
Use ${{ secrets.NAME }}. Never echo them into logs (mask is automatic but only for exact matches). For production secrets, prefer OIDC federation with your cloud provider over long-lived tokens.
Reproducibility from CI
A CI build should be reproducible, same input, same output. Things that break this:
- Pulling unpinned base images (
python:3.12may drift). Pin to digests for max reproducibility:python:3.12@sha256:.... - Latest dependencies on each install. Use a lockfile (
requirements.txtfrom pip-compile,composer.lock). - System time / random / network. Avoid these in builds; test setup should mock them.
Speeding up CI
| Optimization | Typical gain |
|---|---|
| Cache pip / composer deps | 30–60s |
| Cache Docker layers (GHA cache) | 2–5min |
| Run lint/test/build in parallel jobs | up to 3x wall-time |
| Skip CI on docs-only changes | 100% |
| Self-hosted runners for big builds | 2–3x |
A well-tuned scraper CI completes in 3–5 minutes for a typical commit.
What to try
Set up the Python workflow above on your Catalog108 scraper. Then deliberately break a CSS selector and commit. The fixture test should fail loud in CI. Fix it; the PR turns green. That's the loop you want every day.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.