Scheduling: cron, Airflow, Prefect, Symfony Scheduler
From a cron line on a VPS to a workflow orchestrator with DAGs and retries, the scheduling tools you'll actually pick from.
What you’ll learn
- Choose between cron, Airflow, Prefect, and Symfony Scheduler for a given workload.
- Express dependencies between scrape jobs.
- Handle retries, backfills, and timezone correctness.
Every scraper needs to run on a schedule. The right scheduler depends on workload complexity, and most teams over-engineer this. Cron handles 80% of cases just fine.
The tradeoff space
| Tool | Best for | Complexity |
|---|---|---|
| cron / systemd timers | 1–20 jobs, no dependencies | Trivial |
| Kubernetes CronJob | Same as cron, but in K8s | Low |
| Symfony Scheduler | PHP/Symfony shop, integrated with Messenger | Low |
| Prefect | Dependency graphs, Python, modern UI | Medium |
| Airflow | Complex DAGs, mature ecosystem | High |
| Dagster | Data-asset-centric DAGs | Medium |
| Temporal | Long-running workflows, exactly-once semantics | High |
A common pattern: cron for simple jobs, Prefect or Airflow once dependencies start to matter (scrape → transform → load → notify).
cron, the underrated default
# /etc/cron.d/scrapers
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
MAILTO=alerts@example.com
# Catalog108 daily at 3am UTC
0 3 * * * scraper /usr/local/bin/python -m scraper.daily >> /var/log/scrapers/daily.log 2>&1
# Hourly health stats
0 * * * * scraper /usr/local/bin/python -m scraper.health
Issues with naive cron:
- No automatic retry.
- No alert on failure (
MAILTOis the historic way, often ignored). - No prevention of overlapping runs.
Fixes:
flock -nfor mutex:flock -n /tmp/daily.lock python -m scraper.daily- Wrapper script that emits a Prometheus pushgateway metric on success/failure.
cronitor/healthchecks.io, services that ping you when expected cron didn't run (deadman switch).
The cron + flock + healthchecks.io combo gets you 90% of "production scheduling" with three lines of config.
systemd timers, cron's better-equipped cousin
# /etc/systemd/system/scraper-daily.service
[Service]
Type=oneshot
User=scraper
ExecStart=/usr/local/bin/python -m scraper.daily
# /etc/systemd/system/scraper-daily.timer
[Timer]
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=300
Persistent=true
[Install]
WantedBy=timers.target
Advantages over cron:
Persistent=trueruns missed jobs after a reboot.RandomizedDelaySecstaggers cron herds.- Native integration with journalctl logs.
- Failed runs leave
systemctl statusshowing red.
Symfony Scheduler
Symfony 6.3+ ships Scheduler. Define jobs in code:
#[AsSchedule]
class ScrapingSchedule implements ScheduleProviderInterface
{
public function getSchedule(): Schedule
{
return (new Schedule())
->add(
RecurringMessage::cron('0 3 * * *', new ScrapeCatalog108Message()),
RecurringMessage::cron('0 * * * *', new HealthCheckMessage()),
RecurringMessage::every('5 minutes', new QueueCoordinatorMessage())
)
->stateful($this->lockFactory) // multi-instance safe
->processOnlyLastMissedRun();
}
}
The scheduler dispatches Messenger messages, handlers run in the worker fleet. Crucially, stateful() + a Lock factory makes it safe to run multiple scheduler instances. Only one fires each occurrence.
Run with: php bin/console messenger:consume scheduler_default --time-limit=3600
Prefect (modern Python orchestrator)
Prefect 2/3 reads like normal Python:
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def scrape(url):
return fetch(url)
@task
def store(items):
write_to_postgres(items)
@flow
def daily_scrape():
urls = discover_urls()
items = [scrape(u) for u in urls]
store(items)
daily_scrape.serve(name="daily", cron="0 3 * * *")
You get retries, parameterized runs, a UI, observability, and DAGs as code. The serve() is a long-running process, no separate orchestrator host needed in development.
Prefect Cloud is the hosted option; Prefect Server is self-hosted. For Python-heavy teams that need DAGs, this is the sweet spot.
Airflow (industry standard for complex DAGs)
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
with DAG(
"catalog108_daily",
start_date=datetime(2026, 1, 1),
schedule="0 3 * * *",
catchup=False,
default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
) as dag:
discover = PythonOperator(task_id="discover", python_callable=discover_urls)
scrape = PythonOperator(task_id="scrape", python_callable=scrape_all)
store = PythonOperator(task_id="store", python_callable=store_results)
notify = PythonOperator(task_id="notify", python_callable=send_summary)
discover >> scrape >> store >> notify
Airflow's strengths: mature, vast plugin ecosystem, battle-tested at scale, strong SLA / lineage / monitoring features. Weaknesses: heavier, slower iteration, scheduler intricacies.
If you're already running Airflow for data engineering, putting scrapers there is natural. Greenfield, Prefect feels lighter.
Timezones, the universal gotcha
0 3 * * * ... # 3am where?
Cron runs in the system timezone. Set TZ=UTC in the crontab or trust the system. Mixing locales (server in UTC, schedule expressed in Europe/Paris) is the #1 source of "why didn't it run?"
Same in Symfony Scheduler:
RecurringMessage::cron('0 3 * * *', new Message(), timezone: 'Europe/Paris')
In Airflow / Prefect, schedules are timezone-aware via the timezone parameter on the DAG / flow. Daylight savings transitions create "this day has 23 hours" edge cases, be explicit, document timezone in the DAG name if it matters.
Backfills
Backfill = "this scrape was supposed to run for the last 10 days but didn't; run them now." Airflow has first-class backfill (airflow dags backfill ...); Prefect supports it via parameterised runs. With cron/Symfony Scheduler you script it manually.
schedule_interval semantics in Airflow are notoriously tricky, the run for 2026-05-12 actually starts at the END of that interval. Read the docs once and trust the convention.
Picking one
A practical rule:
- <20 jobs, no inter-job dependencies: cron / systemd timers + healthchecks.io.
- In Symfony / PHP shop: Symfony Scheduler with Messenger.
- In K8s: K8s CronJobs (cleaner than node-level cron).
- DAGs, dependencies, lineage, Python-heavy: Prefect for new projects, Airflow if the org already uses it.
Over-orchestration is a tax. Start with cron; graduate when the pain is real.
What to try
Schedule your Catalog108 scraper to run hourly using each of:
- cron + flock + healthchecks.io.
- Symfony Scheduler with
every('1 hour'). - Prefect's
serve(cron='0 * * * *').
Pick the one whose ergonomics feel right for your stack. The cron version is the smallest config; Prefect's UI is the easiest to operate.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.