CI/CD for Web Scrapers
Set up continuous integration and deployment for your web scrapers using GitHub Actions with automated testing and deployment.
Deployment · #10intermediate3 min read
Scrapers break when websites change their structure. A CI/CD pipeline catches these breakages early by running tests automatically and deploying updates with confidence.
What to Test in a Scraper
| Test Type | What It Catches |
|---|---|
| Unit tests | Parsing logic bugs after code changes |
| Integration tests | Site structure changes that break selectors |
| Smoke tests | Complete pipeline failures |
| Schema validation | Missing or malformed data fields |
Writing Testable Scraper Code
Separate fetching from parsing so you can test parsing independently:
# scraper.py
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
@dataclass
class Product:
name: str
price: float
url: str
def fetch_page(url: str) -> str:
"""Fetch a URL and return HTML."""
response = requests.get(url, timeout=30)
response.raise_for_status()
return response.text
def parse_products(html: str) -> list[Product]:
"""Parse product data from HTML."""
soup = BeautifulSoup(html, "html.parser")
products = []
for item in soup.select(".product-card"):
name = item.select_one(".product-name")
price = item.select_one(".product-price")
link = item.select_one("a")
if name and price and link:
products.append(Product(
name=name.text.strip(),
price=float(price.text.strip().replace("$", "")),
url=link.get("href", ""),
))
return products
# test_scraper.py
import pytest
from scraper import parse_products, Product
SAMPLE_HTML = """
<div class="product-card">
<a href="/product/1">
<span class="product-name">Widget A</span>
<span class="product-price">$29.99</span>
</a>
</div>
<div class="product-card">
<a href="/product/2">
<span class="product-name">Widget B</span>
<span class="product-price">$49.99</span>
</a>
</div>
"""
def test_parse_products():
products = parse_products(SAMPLE_HTML)
assert len(products) == 2
assert products[0].name == "Widget A"
assert products[0].price == 29.99
assert products[1].url == "/product/2"
def test_parse_empty_html():
products = parse_products("<html><body></body></html>")
assert len(products) == 0
def test_product_schema():
products = parse_products(SAMPLE_HTML)
for p in products:
assert isinstance(p.name, str) and len(p.name) > 0
assert isinstance(p.price, float) and p.price > 0
assert p.url.startswith("/")
GitHub Actions Workflow
# .github/workflows/scraper-ci.yml
name: Scraper CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
# Run integration tests daily to detect site changes
- cron: "0 8 * * *"
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest
- name: Run unit tests
run: pytest tests/ -v
- name: Run smoke test
run: python -c "from scraper import fetch_page; print('Smoke test passed')"
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- name: Deploy to VPS
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.VPS_HOST }}
username: scraper
key: ${{ secrets.VPS_SSH_KEY }}
script: |
cd /home/scraper/my-scraper
git pull origin main
source venv/bin/activate
pip install -r requirements.txt
sudo systemctl restart scraper
Integration Test for Live Sites
Run this on a schedule to detect when a target site changes:
# tests/test_integration.py
import pytest
from scraper import fetch_page, parse_products
@pytest.mark.integration
def test_live_scrape():
"""Test against the actual live site."""
html = fetch_page("https://target-site.com/products")
products = parse_products(html)
assert len(products) > 0, "No products found - site structure may have changed"
for p in products:
assert p.name, "Product name is empty"
assert p.price > 0, f"Invalid price for {p.name}"
Deployment Strategies
| Strategy | Complexity | Safety |
|---|---|---|
| Direct SSH deploy | Low | Low |
| Docker push + pull | Medium | Medium |
| Blue-green with health check | High | High |
Tips
- Run integration tests on a daily schedule, not just on push
- Save HTML snapshots of target pages in your test fixtures
- Set up Slack/email alerts when scheduled integration tests fail
- Use environment variables for secrets, never commit API keys