Symfony Lock and Rate Limiter for Polite Scraping
Two Symfony components that turn 'be polite to the target' from intention into enforcement. Distributed locks for one-scraper-per-domain; rate limiters for request-per-second caps.
What you’ll learn
- Use Symfony Lock to ensure only one scraper instance runs at a time.
- Apply RateLimiter to throttle requests per domain.
- Honour robots.txt programmatically alongside these controls.
Politeness is not a feeling. It's enforcement. Two Symfony components turn "we should be polite" into code: Lock ensures only one scraper runs at a time per target; RateLimiter caps requests per second.
The Lock component
A lock prevents concurrent execution. Two flavors matter for scraping:
- Per-machine locks,
flockover a file. Cheap, simple. - Distributed locks, Redis, Postgres, Memcached. Survive across machines.
use Symfony\Component\Lock\LockFactory;
use Symfony\Component\Lock\Store\RedisStore;
$store = new RedisStore($redis);
$factory = new LockFactory($store);
$lock = $factory->createLock('scrape-catalog108', ttl: 3600);
if (!$lock->acquire()) {
return; // another instance is running
}
try {
// do the scrape
} finally {
$lock->release();
}
ttl: 3600 is the safety net, if the lock holder crashes, after an hour Redis expires the key and another worker can take over. Choose the TTL > the expected scrape duration.
Symfony configuration
# config/packages/lock.yaml
framework:
lock:
scraping: '%env(REDIS_URL)%'
Inject:
public function __construct(
#[Target('scraping.lock.store')] private LockFactory $factory,
) {}
When to use which lock
| Lock type | Use when |
|---|---|
flock (file) |
Single machine; same Unix host |
| Redis | Multiple machines, same Redis cluster |
| Doctrine (Postgres) | You already have Postgres; no extra infra |
| Combined (lock.factory) | Auto-pick based on configured stores |
For distributed scraping, Redis is the default. Postgres works fine if Redis isn't around.
The RateLimiter component
Caps how often an operation can happen. Two algorithms ship out of the box:
- Fixed window, N per period. Simple. Allows bursts at window edges.
- Token bucket, N tokens replenished at rate R. Smoother.
Configure:
framework:
rate_limiter:
catalog108_scrape:
policy: 'token_bucket'
limit: 60
rate: { interval: '1 minute', amount: 60 }
Inject:
public function __construct(
private RateLimiterFactory $catalog108ScrapeLimiter,
) {}
public function fetch(string $url): string
{
$limiter = $this->catalog108ScrapeLimiter->create('catalog108');
$limiter->consume()->wait(); // blocks until a token is available
return $this->http->request('GET', $url)->getContent();
}
consume()->wait() blocks for as long as needed to acquire a token. For non-blocking, check isAccepted() and either retry later or re-queue.
Per-domain limiters
Different domains, different limits:
$limiter = $this->factory->create($domain);
The limiter key namespaces tokens by domain. One bucket per host. The factory pools the configurations.
Wait vs queue
For Messenger handlers, blocking via wait() is usually wrong, you're tying up a worker. Better: check if the token is available; if not, re-dispatch the message with a delay.
public function __invoke(ScrapeProductMessage $msg): void
{
$limiter = $this->factory->create($this->host($msg->url));
if (!$limiter->consume()->isAccepted()) {
$this->bus->dispatch(
$msg,
[new DelayStamp(2000)], // try again in 2 seconds
);
return;
}
// do the fetch
}
This pattern keeps workers free for other work while honouring the limit.
robots.txt, read it, enforce it
There's no first-party Symfony component for robots.txt, but it's two lines with spatie/robots-txt:
use Spatie\Robots\Robots;
$robots = Robots::create('CatalogScraper/1.0');
if (!$robots->mayIndex('https://practice.scrapingcentral.com/account/dashboard')) {
$this->logger->info('robots.txt disallows; skipping', ['url' => $url]);
return;
}
Cache the robots.txt fetch per domain (it doesn't change often). A small middleware-style decorator on HttpClient can enforce it automatically.
Combining the three
A polite scrape combines all three:
public function scrape(string $url): ?string
{
$host = parse_url($url, PHP_URL_HOST);
// 1. robots.txt
if (!$this->robots->mayIndex($url)) return null;
// 2. per-host rate limit
$limiter = $this->limiterFactory->create($host);
$limiter->consume()->wait();
// 3. per-domain lock (one scraper at a time)
$lock = $this->lockFactory->createLock("scrape-$host", ttl: 3600);
if (!$lock->acquire(blocking: true)) return null;
try {
return $this->http->request('GET', $url)->getContent();
} finally {
$lock->release();
}
}
This wrapper is the contract: respects robots, respects rate, ensures only one scraper of this host runs. A 5-line abstraction with serious safety properties.
Rate limit signaling, the 429 loop
If a target returns 429 with a Retry-After header, that's the source of truth. Override your limiter for that window:
if ($response->getStatusCode() === 429) {
$retryAfter = (int) ($response->getHeaders()['retry-after'][0] ?? 60);
sleep($retryAfter); // or re-queue with delay
}
The target is telling you to slow down. Listen.
When not to use these
For one-off scripts: skip. Lock and RateLimiter are infrastructure. A 50-line script doesn't need them.
For low-volume scrapers running once a day: rate-limiting is overkill. Just sleep a second between requests.
For systems that absolutely cannot block on locks: design for at-least-once delivery and use UPSERT for idempotency instead of mutual exclusion.
Hands-on lab
In your Symfony project:
- Configure a
catalog108_scraperate limiter at 30 req/min. - Configure a Redis-backed lock store.
- Wrap HttpClient calls in a service that: checks robots.txt, consumes a rate-limit token, acquires a per-host lock.
- Run two
messenger:consumeworkers in parallel. Watch them coordinate, only one should fetch from the target at a time.
That coordination, achieved without any scraper-specific code, is what Symfony's infrastructure components give you.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/robots.txtQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.