CI/CD for Web Scrapers - Deployment

Set up continuous integration and deployment for your web scrapers using GitHub Actions with automated testing and deployment.

Scrapers break when websites change their structure. A CI/CD pipeline catches these breakages early by running tests automatically and deploying updates with confidence.

What to Test in a Scraper

Test Type	What It Catches
Unit tests	Parsing logic bugs after code changes
Integration tests	Site structure changes that break selectors
Smoke tests	Complete pipeline failures
Schema validation	Missing or malformed data fields

Writing Testable Scraper Code

Separate fetching from parsing so you can test parsing independently:

# scraper.py
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: float
    url: str

def fetch_page(url: str) -> str:
    """Fetch a URL and return HTML."""
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    return response.text

def parse_products(html: str) -> list[Product]:
    """Parse product data from HTML."""
    soup = BeautifulSoup(html, "html.parser")
    products = []
    for item in soup.select(".product-card"):
        name = item.select_one(".product-name")
        price = item.select_one(".product-price")
        link = item.select_one("a")
        if name and price and link:
            products.append(Product(
                name=name.text.strip(),
                price=float(price.text.strip().replace("$", "")),
                url=link.get("href", ""),
            ))
    return products

# test_scraper.py
import pytest
from scraper import parse_products, Product

SAMPLE_HTML = """
<div class="product-card">
    <a href="/product/1">
        <span class="product-name">Widget A</span>
        <span class="product-price">$29.99</span>
    </a>
</div>
<div class="product-card">
    <a href="/product/2">
        <span class="product-name">Widget B</span>
        <span class="product-price">$49.99</span>
    </a>
</div>
"""

def test_parse_products():
    products = parse_products(SAMPLE_HTML)
    assert len(products) == 2
    assert products[0].name == "Widget A"
    assert products[0].price == 29.99
    assert products[1].url == "/product/2"

def test_parse_empty_html():
    products = parse_products("<html><body></body></html>")
    assert len(products) == 0

def test_product_schema():
    products = parse_products(SAMPLE_HTML)
    for p in products:
        assert isinstance(p.name, str) and len(p.name) > 0
        assert isinstance(p.price, float) and p.price > 0
        assert p.url.startswith("/")

GitHub Actions Workflow

# .github/workflows/scraper-ci.yml
name: Scraper CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    # Run integration tests daily to detect site changes
    - cron: "0 8 * * *"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest

      - name: Run unit tests
        run: pytest tests/ -v

      - name: Run smoke test
        run: python -c "from scraper import fetch_page; print('Smoke test passed')"

  deploy:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to VPS
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.VPS_HOST }}
          username: scraper
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            cd /home/scraper/my-scraper
            git pull origin main
            source venv/bin/activate
            pip install -r requirements.txt
            sudo systemctl restart scraper

Integration Test for Live Sites

Run this on a schedule to detect when a target site changes:

# tests/test_integration.py
import pytest
from scraper import fetch_page, parse_products

@pytest.mark.integration
def test_live_scrape():
    """Test against the actual live site."""
    html = fetch_page("https://target-site.com/products")
    products = parse_products(html)

    assert len(products) > 0, "No products found - site structure may have changed"
    for p in products:
        assert p.name, "Product name is empty"
        assert p.price > 0, f"Invalid price for {p.name}"

Deployment Strategies

Strategy	Complexity	Safety
Direct SSH deploy	Low	Low
Docker push + pull	Medium	Medium
Blue-green with health check	High	High

Tips

Run integration tests on a daily schedule, not just on push
Save HTML snapshots of target pages in your test fixtures
Set up Slack/email alerts when scheduled integration tests fail
Use environment variables for secrets, never commit API keys