Frontera and Scrapy Cluster
Frontiers, the queue-of-URLs abstraction for huge crawls. Frontera and Scrapy Cluster are the two major implementations for Python.
What you’ll learn
- Understand the 'frontier' concept for large crawls.
- Use Frontera as a backend for distributed Scrapy spiders.
- Compare Frontera, Scrapy Cluster, and scrapy-redis.
For huge crawls (millions of URLs, multi-day runs, web-scale targets), the URL queue itself becomes the central abstraction, the "frontier." Frontera and Scrapy Cluster are the two mature Python implementations.
What's a "frontier"?
In web crawling jargon, the frontier is the set of URLs known to exist but not yet fetched. It's:
- Persistent (survives restarts).
- Prioritized (some URLs are more important).
- Deduplicated (don't fetch the same URL twice).
- Politeness-aware (respect per-host limits).
scrapy-redis covers a basic version. Frontera and Scrapy Cluster are designed for crawls where the frontier has special needs, billions of URLs, complex priority logic, custom backends.
Frontera
frontera is a Python framework (originally from Scrapinghub) for managing crawl frontiers. Plug-in architecture:
- Backend. Where URLs are stored (Redis, HBase, SQL, etc.).
- Strategy. How URLs are prioritized.
- Manager. Coordinates the backend, strategy, and crawler.
Integration with Scrapy
# scrapy settings.py
SPIDER_MIDDLEWARES = {
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
}
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
# frontera config
FRONTERA_SETTINGS = 'myproject.frontera_settings'
# frontera_settings.py
BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFOBackend'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///frontier.db'
MAX_NEXT_REQUESTS = 10
SPIDER_LOG_CONSUMER_BATCH_SIZE = 512
The Frontera scheduler replaces Scrapy's default. URL discovery from the spider flows through the frontier; URL retrieval (next batch to crawl) comes back from it.
Strategies
Built-in strategies:
- BFS (breadth-first), fan out from seed URLs.
- DFS (depth-first).
- Discovery, explore by topical similarity.
Custom strategies implement BaseCrawlingStrategy, score URLs based on your logic (e.g. "prefer pages with newer content," "prioritize high-value domains").
Scrapy Cluster
A different approach: scrapy-cluster (also originally from Scrapinghub) is a microservices stack for distributed Scrapy:
- Kafka for message bus.
- Redis for state/dedup.
- Scrapy spiders as consumers.
- REST API to submit crawl jobs.
Architecture:
Client (REST POST)
↓
Kafka "demand" topic
↓
N Scrapy instances pull, crawl
↓
Kafka "results" topic
↓
Result consumers (Postgres, ES, etc.)
The complexity is real but the throughput ceiling is high. Internet-scale crawls (Common Crawl-style) use architectures like this.
scrapy-redis vs Frontera vs Scrapy Cluster
| Concern | scrapy-redis | Frontera | Scrapy Cluster |
|---|---|---|---|
| Setup complexity | Low | Medium | High |
| Max scale | ~10M URLs | ~1B URLs | Web-scale |
| Priority logic | Limited | Pluggable | Pluggable |
| Backends | Redis only | Multiple (SQL, HBase) | Kafka + Redis |
| Best for | Small-medium clusters | Large priority-aware crawls | Web-scale archives |
Most production scraping never reaches Frontera, let alone Scrapy Cluster. scrapy-redis covers 90% of cases.
When you genuinely need Frontera
- Crawls with >10M frontier URLs.
- Domain-prioritization logic ("scrape news sites every hour, blogs every day").
- Resume long crawls reliably.
- Multiple priority queues per crawl.
When you don't:
- "Crawl this catalogue of 100k products." scrapy-redis is simpler.
- One-shot crawls.
Politeness in Frontera
Frontera understands per-host budgets:
DELAY_ON_EMPTY = 5.0 # seconds to sleep when queue is empty
HOST_LIMIT = 50_000 # max URLs per host
HOST_DELAY_ENFORCE_BY_DOMAIN = True
Useful for crawls spanning thousands of domains, each host has its own rate.
Custom backends
For huge frontiers, SQLite isn't enough. Frontera supports SQL backends (Postgres) and HBase. For multi-million URL frontiers, HBase backend on a small cluster is common, Frontera reads URLs in batches, marks them in-flight, marks completed.
BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend'
HBASE_THRIFT_HOST = 'hbase.local'
HBASE_THRIFT_PORT = 9090
The frontier becomes its own service, backed by a real database, accessible from many crawler workers.
Modern alternatives
The landscape has shifted since Frontera's heyday (~2015-2018):
- Lighter frontends + Kafka for new projects.
- Cloud-native crawlers (Apify, Bright Data crawler hosting).
- Scrapy + scrapy-redis + ad-hoc priority logic for medium scale.
Frontera and Scrapy Cluster are still maintained but used less than they were. For most projects, scrapy-redis + custom Redis-backed priorities + Symfony Messenger or Celery is the pragmatic stack.
What to learn from these tools
Even if you never deploy Frontera, the concepts apply:
- Frontier as a service, separated from crawler.
- Priority queues as first-class.
- Dedup at scale via Bloom filters or HBase row keys.
- Per-host budgets enforced by the frontier, not the crawler.
These patterns appear in any large crawl, built explicitly or accidentally.
Hands-on lab
Conceptual exercise (Frontera is complex to install):
- Sketch out the frontier abstraction for a 1M-URL crawl. What stores URLs? What dedup? What priority?
- Compare your sketch to scrapy-redis (what's missing?) and Frontera (what's added?).
- Decide: for your real scraping projects, which level fits?
For most readers: scrapy-redis. For some: Frontera. For very few: Scrapy Cluster. Knowing where the line is saves you from over-engineering.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.