Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scheduling Scrapers with Cron Jobs

Learn how to schedule your Python web scrapers to run automatically using cron jobs on Linux and macOS.

Deployment · #4beginner3 min read
Share:WhatsAppLinkedIn

Cron is the simplest way to run your scraper on a schedule. It is built into every Linux and macOS system and requires zero additional infrastructure.

Cron Schedule Syntax

┌──────── minute (0-59)
│ ┌────── hour (0-23)
│ │ ┌──── day of month (1-31)
│ │ │ ┌── month (1-12)
│ │ │ │ ┌ day of week (0-7, 0 and 7 = Sunday)
│ │ │ │ │
* * * * *  command

Common examples:

# Every hour
0 * * * *

# Every day at 6 AM
0 6 * * *

# Every 15 minutes
*/15 * * * *

# Every Monday at 9 AM
0 9 * * 1

# Twice daily at 8 AM and 8 PM
0 8,20 * * *

Setting Up a Cron Job

# Edit your crontab
crontab -e

# Add this line to run your scraper every hour
0 * * * * /home/scraper/my-scraper/venv/bin/python /home/scraper/my-scraper/main.py >> /home/scraper/my-scraper/logs/cron.log 2>&1

A Cron-Friendly Scraper

Design your scraper to run once and exit (not loop):

#!/usr/bin/env python3
# main.py - Designed for cron execution

import requests
from bs4 import BeautifulSoup
import json
import logging
from datetime import datetime
from pathlib import Path

# Set up logging
LOG_DIR = Path(__file__).parent / "logs"
LOG_DIR.mkdir(exist_ok=True)

logging.basicConfig(
    filename=LOG_DIR / "scraper.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)

def scrape_and_save():
    url = "https://news.ycombinator.com"
    logging.info(f"Starting scrape of {url}")

    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (compatible; MyCrawler/1.0)"
    })
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    items = []
    for link in soup.select(".titleline > a"):
        items.append({"title": link.text, "url": link.get("href", "")})

    # Save with timestamp
    data_dir = Path(__file__).parent / "data"
    data_dir.mkdir(exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output = data_dir / f"hn_{timestamp}.json"

    with open(output, "w") as f:
        json.dump(items, f, indent=2)

    logging.info(f"Saved {len(items)} items to {output}")

if __name__ == "__main__":
    try:
        scrape_and_save()
    except Exception as e:
        logging.error(f"Scrape failed: {e}")
        raise

Important Cron Gotchas

1. Always Use Absolute Paths

Cron does not run in your user's shell context:

# WRONG - relative paths fail in cron
*/30 * * * * python main.py

# RIGHT - absolute paths everywhere
*/30 * * * * /home/scraper/venv/bin/python /home/scraper/main.py

2. Set Environment Variables

# Add environment variables at the top of your crontab
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
PROXY_URL=http://user:pass@proxy.example.com:8080

0 * * * * /home/scraper/venv/bin/python /home/scraper/main.py

3. Prevent Overlapping Runs

Use flock to ensure only one instance runs at a time:

0 * * * * /usr/bin/flock -n /tmp/scraper.lock /home/scraper/venv/bin/python /home/scraper/main.py

Log Rotation

Prevent logs from filling your disk:

# /etc/logrotate.d/scraper
/home/scraper/my-scraper/logs/*.log {
    weekly
    rotate 4
    compress
    missingok
    notifempty
}

Monitoring Cron Jobs

Check if your cron job is running:

# View cron execution logs
grep CRON /var/log/syslog

# List your crontab
crontab -l

For production scrapers, consider adding email or Slack notifications on failure to catch issues early.