Scheduling Scrapers with Cron Jobs - Deployment

Learn how to schedule your Python web scrapers to run automatically using cron jobs on Linux and macOS.

Cron is the simplest way to run your scraper on a schedule. It is built into every Linux and macOS system and requires zero additional infrastructure.

Cron Schedule Syntax

┌──────── minute (0-59)
│ ┌────── hour (0-23)
│ │ ┌──── day of month (1-31)
│ │ │ ┌── month (1-12)
│ │ │ │ ┌ day of week (0-7, 0 and 7 = Sunday)
│ │ │ │ │
* * * * *  command

Common examples:

# Every hour
0 * * * *

# Every day at 6 AM
0 6 * * *

# Every 15 minutes
*/15 * * * *

# Every Monday at 9 AM
0 9 * * 1

# Twice daily at 8 AM and 8 PM
0 8,20 * * *

Setting Up a Cron Job

# Edit your crontab
crontab -e

# Add this line to run your scraper every hour
0 * * * * /home/scraper/my-scraper/venv/bin/python /home/scraper/my-scraper/main.py >> /home/scraper/my-scraper/logs/cron.log 2>&1

A Cron-Friendly Scraper

Design your scraper to run once and exit (not loop):

#!/usr/bin/env python3
# main.py - Designed for cron execution

import requests
from bs4 import BeautifulSoup
import json
import logging
from datetime import datetime
from pathlib import Path

# Set up logging
LOG_DIR = Path(__file__).parent / "logs"
LOG_DIR.mkdir(exist_ok=True)

logging.basicConfig(
    filename=LOG_DIR / "scraper.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)

def scrape_and_save():
    url = "https://news.ycombinator.com"
    logging.info(f"Starting scrape of {url}")

    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (compatible; MyCrawler/1.0)"
    })
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    items = []
    for link in soup.select(".titleline > a"):
        items.append({"title": link.text, "url": link.get("href", "")})

    # Save with timestamp
    data_dir = Path(__file__).parent / "data"
    data_dir.mkdir(exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output = data_dir / f"hn_{timestamp}.json"

    with open(output, "w") as f:
        json.dump(items, f, indent=2)

    logging.info(f"Saved {len(items)} items to {output}")

if __name__ == "__main__":
    try:
        scrape_and_save()
    except Exception as e:
        logging.error(f"Scrape failed: {e}")
        raise

Important Cron Gotchas

1. Always Use Absolute Paths

Cron does not run in your user's shell context:

# WRONG - relative paths fail in cron
*/30 * * * * python main.py

# RIGHT - absolute paths everywhere
*/30 * * * * /home/scraper/venv/bin/python /home/scraper/main.py

2. Set Environment Variables

# Add environment variables at the top of your crontab
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
PROXY_URL=http://user:pass@proxy.example.com:8080

0 * * * * /home/scraper/venv/bin/python /home/scraper/main.py

3. Prevent Overlapping Runs

Use flock to ensure only one instance runs at a time:

0 * * * * /usr/bin/flock -n /tmp/scraper.lock /home/scraper/venv/bin/python /home/scraper/main.py

Log Rotation

Prevent logs from filling your disk:

# /etc/logrotate.d/scraper
/home/scraper/my-scraper/logs/*.log {
    weekly
    rotate 4
    compress
    missingok
    notifempty
}

Monitoring Cron Jobs

Check if your cron job is running:

# View cron execution logs
grep CRON /var/log/syslog

# List your crontab
crontab -l

For production scrapers, consider adding email or Slack notifications on failure to catch issues early.