Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.47advanced4 min read

Frontera and Scrapy Cluster

Frontiers, the queue-of-URLs abstraction for huge crawls. Frontera and Scrapy Cluster are the two major implementations for Python.

What you’ll learn

  • Understand the 'frontier' concept for large crawls.
  • Use Frontera as a backend for distributed Scrapy spiders.
  • Compare Frontera, Scrapy Cluster, and scrapy-redis.

For huge crawls (millions of URLs, multi-day runs, web-scale targets), the URL queue itself becomes the central abstraction, the "frontier." Frontera and Scrapy Cluster are the two mature Python implementations.

What's a "frontier"?

In web crawling jargon, the frontier is the set of URLs known to exist but not yet fetched. It's:

  • Persistent (survives restarts).
  • Prioritized (some URLs are more important).
  • Deduplicated (don't fetch the same URL twice).
  • Politeness-aware (respect per-host limits).

scrapy-redis covers a basic version. Frontera and Scrapy Cluster are designed for crawls where the frontier has special needs, billions of URLs, complex priority logic, custom backends.

Frontera

frontera is a Python framework (originally from Scrapinghub) for managing crawl frontiers. Plug-in architecture:

  • Backend. Where URLs are stored (Redis, HBase, SQL, etc.).
  • Strategy. How URLs are prioritized.
  • Manager. Coordinates the backend, strategy, and crawler.

Integration with Scrapy

# scrapy settings.py
SPIDER_MIDDLEWARES = {
  'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
}
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'

# frontera config
FRONTERA_SETTINGS = 'myproject.frontera_settings'
# frontera_settings.py
BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFOBackend'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///frontier.db'
MAX_NEXT_REQUESTS = 10
SPIDER_LOG_CONSUMER_BATCH_SIZE = 512

The Frontera scheduler replaces Scrapy's default. URL discovery from the spider flows through the frontier; URL retrieval (next batch to crawl) comes back from it.

Strategies

Built-in strategies:

  • BFS (breadth-first), fan out from seed URLs.
  • DFS (depth-first).
  • Discovery, explore by topical similarity.

Custom strategies implement BaseCrawlingStrategy, score URLs based on your logic (e.g. "prefer pages with newer content," "prioritize high-value domains").

Scrapy Cluster

A different approach: scrapy-cluster (also originally from Scrapinghub) is a microservices stack for distributed Scrapy:

  • Kafka for message bus.
  • Redis for state/dedup.
  • Scrapy spiders as consumers.
  • REST API to submit crawl jobs.

Architecture:

Client (REST POST)
  ↓
Kafka "demand" topic
  ↓
N Scrapy instances pull, crawl
  ↓
Kafka "results" topic
  ↓
Result consumers (Postgres, ES, etc.)

The complexity is real but the throughput ceiling is high. Internet-scale crawls (Common Crawl-style) use architectures like this.

scrapy-redis vs Frontera vs Scrapy Cluster

Concern scrapy-redis Frontera Scrapy Cluster
Setup complexity Low Medium High
Max scale ~10M URLs ~1B URLs Web-scale
Priority logic Limited Pluggable Pluggable
Backends Redis only Multiple (SQL, HBase) Kafka + Redis
Best for Small-medium clusters Large priority-aware crawls Web-scale archives

Most production scraping never reaches Frontera, let alone Scrapy Cluster. scrapy-redis covers 90% of cases.

When you genuinely need Frontera

  • Crawls with >10M frontier URLs.
  • Domain-prioritization logic ("scrape news sites every hour, blogs every day").
  • Resume long crawls reliably.
  • Multiple priority queues per crawl.

When you don't:

  • "Crawl this catalogue of 100k products." scrapy-redis is simpler.
  • One-shot crawls.

Politeness in Frontera

Frontera understands per-host budgets:

DELAY_ON_EMPTY = 5.0  # seconds to sleep when queue is empty
HOST_LIMIT = 50_000  # max URLs per host
HOST_DELAY_ENFORCE_BY_DOMAIN = True

Useful for crawls spanning thousands of domains, each host has its own rate.

Custom backends

For huge frontiers, SQLite isn't enough. Frontera supports SQL backends (Postgres) and HBase. For multi-million URL frontiers, HBase backend on a small cluster is common, Frontera reads URLs in batches, marks them in-flight, marks completed.

BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend'
HBASE_THRIFT_HOST = 'hbase.local'
HBASE_THRIFT_PORT = 9090

The frontier becomes its own service, backed by a real database, accessible from many crawler workers.

Modern alternatives

The landscape has shifted since Frontera's heyday (~2015-2018):

  • Lighter frontends + Kafka for new projects.
  • Cloud-native crawlers (Apify, Bright Data crawler hosting).
  • Scrapy + scrapy-redis + ad-hoc priority logic for medium scale.

Frontera and Scrapy Cluster are still maintained but used less than they were. For most projects, scrapy-redis + custom Redis-backed priorities + Symfony Messenger or Celery is the pragmatic stack.

What to learn from these tools

Even if you never deploy Frontera, the concepts apply:

  • Frontier as a service, separated from crawler.
  • Priority queues as first-class.
  • Dedup at scale via Bloom filters or HBase row keys.
  • Per-host budgets enforced by the frontier, not the crawler.

These patterns appear in any large crawl, built explicitly or accidentally.

Hands-on lab

Conceptual exercise (Frontera is complex to install):

  1. Sketch out the frontier abstraction for a 1M-URL crawl. What stores URLs? What dedup? What priority?
  2. Compare your sketch to scrapy-redis (what's missing?) and Frontera (what's added?).
  3. Decide: for your real scraping projects, which level fits?

For most readers: scrapy-redis. For some: Frontera. For very few: Scrapy Cluster. Knowing where the line is saves you from over-engineering.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Frontera and Scrapy Cluster1 / 8

What is the 'frontier' in web crawling terminology?

Score so far: 0 / 0