Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

CI/CD for Web Scrapers

Set up continuous integration and deployment for your web scrapers using GitHub Actions with automated testing and deployment.

Deployment · #10intermediate3 min read
Share:WhatsAppLinkedIn

Scrapers break when websites change their structure. A CI/CD pipeline catches these breakages early by running tests automatically and deploying updates with confidence.

What to Test in a Scraper

Test Type What It Catches
Unit tests Parsing logic bugs after code changes
Integration tests Site structure changes that break selectors
Smoke tests Complete pipeline failures
Schema validation Missing or malformed data fields

Writing Testable Scraper Code

Separate fetching from parsing so you can test parsing independently:

# scraper.py
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: float
    url: str

def fetch_page(url: str) -> str:
    """Fetch a URL and return HTML."""
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    return response.text

def parse_products(html: str) -> list[Product]:
    """Parse product data from HTML."""
    soup = BeautifulSoup(html, "html.parser")
    products = []
    for item in soup.select(".product-card"):
        name = item.select_one(".product-name")
        price = item.select_one(".product-price")
        link = item.select_one("a")
        if name and price and link:
            products.append(Product(
                name=name.text.strip(),
                price=float(price.text.strip().replace("$", "")),
                url=link.get("href", ""),
            ))
    return products
# test_scraper.py
import pytest
from scraper import parse_products, Product

SAMPLE_HTML = """
<div class="product-card">
    <a href="/product/1">
        <span class="product-name">Widget A</span>
        <span class="product-price">$29.99</span>
    </a>
</div>
<div class="product-card">
    <a href="/product/2">
        <span class="product-name">Widget B</span>
        <span class="product-price">$49.99</span>
    </a>
</div>
"""

def test_parse_products():
    products = parse_products(SAMPLE_HTML)
    assert len(products) == 2
    assert products[0].name == "Widget A"
    assert products[0].price == 29.99
    assert products[1].url == "/product/2"

def test_parse_empty_html():
    products = parse_products("<html><body></body></html>")
    assert len(products) == 0

def test_product_schema():
    products = parse_products(SAMPLE_HTML)
    for p in products:
        assert isinstance(p.name, str) and len(p.name) > 0
        assert isinstance(p.price, float) and p.price > 0
        assert p.url.startswith("/")

GitHub Actions Workflow

# .github/workflows/scraper-ci.yml
name: Scraper CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    # Run integration tests daily to detect site changes
    - cron: "0 8 * * *"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest

      - name: Run unit tests
        run: pytest tests/ -v

      - name: Run smoke test
        run: python -c "from scraper import fetch_page; print('Smoke test passed')"

  deploy:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to VPS
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.VPS_HOST }}
          username: scraper
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            cd /home/scraper/my-scraper
            git pull origin main
            source venv/bin/activate
            pip install -r requirements.txt
            sudo systemctl restart scraper

Integration Test for Live Sites

Run this on a schedule to detect when a target site changes:

# tests/test_integration.py
import pytest
from scraper import fetch_page, parse_products

@pytest.mark.integration
def test_live_scrape():
    """Test against the actual live site."""
    html = fetch_page("https://target-site.com/products")
    products = parse_products(html)

    assert len(products) > 0, "No products found - site structure may have changed"
    for p in products:
        assert p.name, "Product name is empty"
        assert p.price > 0, f"Invalid price for {p.name}"

Deployment Strategies

Strategy Complexity Safety
Direct SSH deploy Low Low
Docker push + pull Medium Medium
Blue-green with health check High High

Tips

  • Run integration tests on a daily schedule, not just on push
  • Save HTML snapshots of target pages in your test fixtures
  • Set up Slack/email alerts when scheduled integration tests fail
  • Use environment variables for secrets, never commit API keys