Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.66intermediate5 min read

Scheduling: cron, Airflow, Prefect, Symfony Scheduler

From a cron line on a VPS to a workflow orchestrator with DAGs and retries, the scheduling tools you'll actually pick from.

What you’ll learn

  • Choose between cron, Airflow, Prefect, and Symfony Scheduler for a given workload.
  • Express dependencies between scrape jobs.
  • Handle retries, backfills, and timezone correctness.

Every scraper needs to run on a schedule. The right scheduler depends on workload complexity, and most teams over-engineer this. Cron handles 80% of cases just fine.

The tradeoff space

Tool Best for Complexity
cron / systemd timers 1–20 jobs, no dependencies Trivial
Kubernetes CronJob Same as cron, but in K8s Low
Symfony Scheduler PHP/Symfony shop, integrated with Messenger Low
Prefect Dependency graphs, Python, modern UI Medium
Airflow Complex DAGs, mature ecosystem High
Dagster Data-asset-centric DAGs Medium
Temporal Long-running workflows, exactly-once semantics High

A common pattern: cron for simple jobs, Prefect or Airflow once dependencies start to matter (scrape → transform → load → notify).

cron, the underrated default

# /etc/cron.d/scrapers
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin
MAILTO=alerts@example.com

# Catalog108 daily at 3am UTC
0 3 * * * scraper /usr/local/bin/python -m scraper.daily >> /var/log/scrapers/daily.log 2>&1

# Hourly health stats
0 * * * * scraper /usr/local/bin/python -m scraper.health

Issues with naive cron:

  • No automatic retry.
  • No alert on failure (MAILTO is the historic way, often ignored).
  • No prevention of overlapping runs.

Fixes:

  • flock -n for mutex: flock -n /tmp/daily.lock python -m scraper.daily
  • Wrapper script that emits a Prometheus pushgateway metric on success/failure.
  • cronitor / healthchecks.io, services that ping you when expected cron didn't run (deadman switch).

The cron + flock + healthchecks.io combo gets you 90% of "production scheduling" with three lines of config.

systemd timers, cron's better-equipped cousin

# /etc/systemd/system/scraper-daily.service
[Service]
Type=oneshot
User=scraper
ExecStart=/usr/local/bin/python -m scraper.daily

# /etc/systemd/system/scraper-daily.timer
[Timer]
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=300
Persistent=true

[Install]
WantedBy=timers.target

Advantages over cron:

  • Persistent=true runs missed jobs after a reboot.
  • RandomizedDelaySec staggers cron herds.
  • Native integration with journalctl logs.
  • Failed runs leave systemctl status showing red.

Symfony Scheduler

Symfony 6.3+ ships Scheduler. Define jobs in code:

#[AsSchedule]
class ScrapingSchedule implements ScheduleProviderInterface
{
  public function getSchedule(): Schedule
  {
  return (new Schedule())
  ->add(
  RecurringMessage::cron('0 3 * * *', new ScrapeCatalog108Message()),
  RecurringMessage::cron('0 * * * *', new HealthCheckMessage()),
  RecurringMessage::every('5 minutes', new QueueCoordinatorMessage())
  )
  ->stateful($this->lockFactory)  // multi-instance safe
  ->processOnlyLastMissedRun();
  }
}

The scheduler dispatches Messenger messages, handlers run in the worker fleet. Crucially, stateful() + a Lock factory makes it safe to run multiple scheduler instances. Only one fires each occurrence.

Run with: php bin/console messenger:consume scheduler_default --time-limit=3600

Prefect (modern Python orchestrator)

Prefect 2/3 reads like normal Python:

from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def scrape(url):
  return fetch(url)

@task
def store(items):
  write_to_postgres(items)

@flow
def daily_scrape():
  urls = discover_urls()
  items = [scrape(u) for u in urls]
  store(items)

daily_scrape.serve(name="daily", cron="0 3 * * *")

You get retries, parameterized runs, a UI, observability, and DAGs as code. The serve() is a long-running process, no separate orchestrator host needed in development.

Prefect Cloud is the hosted option; Prefect Server is self-hosted. For Python-heavy teams that need DAGs, this is the sweet spot.

Airflow (industry standard for complex DAGs)

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

with DAG(
  "catalog108_daily",
  start_date=datetime(2026, 1, 1),
  schedule="0 3 * * *",
  catchup=False,
  default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
) as dag:

  discover = PythonOperator(task_id="discover", python_callable=discover_urls)
  scrape  = PythonOperator(task_id="scrape",  python_callable=scrape_all)
  store  = PythonOperator(task_id="store",  python_callable=store_results)
  notify  = PythonOperator(task_id="notify",  python_callable=send_summary)

  discover >> scrape >> store >> notify

Airflow's strengths: mature, vast plugin ecosystem, battle-tested at scale, strong SLA / lineage / monitoring features. Weaknesses: heavier, slower iteration, scheduler intricacies.

If you're already running Airflow for data engineering, putting scrapers there is natural. Greenfield, Prefect feels lighter.

Timezones, the universal gotcha

0 3 * * * ...  # 3am where?

Cron runs in the system timezone. Set TZ=UTC in the crontab or trust the system. Mixing locales (server in UTC, schedule expressed in Europe/Paris) is the #1 source of "why didn't it run?"

Same in Symfony Scheduler:

RecurringMessage::cron('0 3 * * *', new Message(), timezone: 'Europe/Paris')

In Airflow / Prefect, schedules are timezone-aware via the timezone parameter on the DAG / flow. Daylight savings transitions create "this day has 23 hours" edge cases, be explicit, document timezone in the DAG name if it matters.

Backfills

Backfill = "this scrape was supposed to run for the last 10 days but didn't; run them now." Airflow has first-class backfill (airflow dags backfill ...); Prefect supports it via parameterised runs. With cron/Symfony Scheduler you script it manually.

schedule_interval semantics in Airflow are notoriously tricky, the run for 2026-05-12 actually starts at the END of that interval. Read the docs once and trust the convention.

Picking one

A practical rule:

  • <20 jobs, no inter-job dependencies: cron / systemd timers + healthchecks.io.
  • In Symfony / PHP shop: Symfony Scheduler with Messenger.
  • In K8s: K8s CronJobs (cleaner than node-level cron).
  • DAGs, dependencies, lineage, Python-heavy: Prefect for new projects, Airflow if the org already uses it.

Over-orchestration is a tax. Start with cron; graduate when the pain is real.

What to try

Schedule your Catalog108 scraper to run hourly using each of:

  1. cron + flock + healthchecks.io.
  2. Symfony Scheduler with every('1 hour').
  3. Prefect's serve(cron='0 * * * *').

Pick the one whose ergonomics feel right for your stack. The cron version is the smallest config; Prefect's UI is the easiest to operate.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Scheduling: cron, Airflow, Prefect, Symfony Scheduler1 / 8

For a Symfony shop running 20 simple scheduled scrapes with no inter-job dependencies, which scheduler is the right default?

Score so far: 0 / 0