Scheduling Scrapers with Cron Jobs
Learn how to schedule your Python web scrapers to run automatically using cron jobs on Linux and macOS.
Deployment · #4beginner3 min read
Cron is the simplest way to run your scraper on a schedule. It is built into every Linux and macOS system and requires zero additional infrastructure.
Cron Schedule Syntax
┌──────── minute (0-59)
│ ┌────── hour (0-23)
│ │ ┌──── day of month (1-31)
│ │ │ ┌── month (1-12)
│ │ │ │ ┌ day of week (0-7, 0 and 7 = Sunday)
│ │ │ │ │
* * * * * command
Common examples:
# Every hour
0 * * * *
# Every day at 6 AM
0 6 * * *
# Every 15 minutes
*/15 * * * *
# Every Monday at 9 AM
0 9 * * 1
# Twice daily at 8 AM and 8 PM
0 8,20 * * *
Setting Up a Cron Job
# Edit your crontab
crontab -e
# Add this line to run your scraper every hour
0 * * * * /home/scraper/my-scraper/venv/bin/python /home/scraper/my-scraper/main.py >> /home/scraper/my-scraper/logs/cron.log 2>&1
A Cron-Friendly Scraper
Design your scraper to run once and exit (not loop):
#!/usr/bin/env python3
# main.py - Designed for cron execution
import requests
from bs4 import BeautifulSoup
import json
import logging
from datetime import datetime
from pathlib import Path
# Set up logging
LOG_DIR = Path(__file__).parent / "logs"
LOG_DIR.mkdir(exist_ok=True)
logging.basicConfig(
filename=LOG_DIR / "scraper.log",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
def scrape_and_save():
url = "https://news.ycombinator.com"
logging.info(f"Starting scrape of {url}")
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (compatible; MyCrawler/1.0)"
})
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
items = []
for link in soup.select(".titleline > a"):
items.append({"title": link.text, "url": link.get("href", "")})
# Save with timestamp
data_dir = Path(__file__).parent / "data"
data_dir.mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output = data_dir / f"hn_{timestamp}.json"
with open(output, "w") as f:
json.dump(items, f, indent=2)
logging.info(f"Saved {len(items)} items to {output}")
if __name__ == "__main__":
try:
scrape_and_save()
except Exception as e:
logging.error(f"Scrape failed: {e}")
raise
Important Cron Gotchas
1. Always Use Absolute Paths
Cron does not run in your user's shell context:
# WRONG - relative paths fail in cron
*/30 * * * * python main.py
# RIGHT - absolute paths everywhere
*/30 * * * * /home/scraper/venv/bin/python /home/scraper/main.py
2. Set Environment Variables
# Add environment variables at the top of your crontab
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
PROXY_URL=http://user:pass@proxy.example.com:8080
0 * * * * /home/scraper/venv/bin/python /home/scraper/main.py
3. Prevent Overlapping Runs
Use flock to ensure only one instance runs at a time:
0 * * * * /usr/bin/flock -n /tmp/scraper.lock /home/scraper/venv/bin/python /home/scraper/main.py
Log Rotation
Prevent logs from filling your disk:
# /etc/logrotate.d/scraper
/home/scraper/my-scraper/logs/*.log {
weekly
rotate 4
compress
missingok
notifempty
}
Monitoring Cron Jobs
Check if your cron job is running:
# View cron execution logs
grep CRON /var/log/syslog
# List your crontab
crontab -l
For production scrapers, consider adding email or Slack notifications on failure to catch issues early.