Tutorial
Rate Limiting in Web Scraping - How to Be Polite
Learn how to implement rate limiting in your web scrapers. Covers delays, backoff strategies, and respectful scraping practices.
Aggressive scraping gets you blocked and can harm target servers. Smart rate limiting keeps your scrapers running reliably while being respectful.
Why Rate Limiting Matters
- Avoid IP bans, Most sites block IPs that send too many requests
- Prevent server overload, Excessive traffic can degrade the target site
- Legal protection, Polite scraping is less likely to attract legal attention
- Better data quality, Rushed scraping leads to missed pages and errors
Basic Delay Implementation
import time
import random
import requests
def scrape_with_delay(urls, min_delay=1, max_delay=3):
results = []
for url in urls:
resp = requests.get(url)
results.append(resp.text)
# Random delay between requests
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
Exponential Backoff
When you get rate-limited (HTTP 429), back off exponentially.
import time
import requests
def fetch_with_backoff(url, max_retries=5):
for attempt in range(max_retries):
resp = requests.get(url)
if resp.status_code == 200:
return resp
if resp.status_code == 429:
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait_time:.1f}s...")
time.sleep(wait_time)
else:
resp.raise_for_status()
raise Exception(f"Failed after {max_retries} retries")
Respecting robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "https://example.com/page"):
# Safe to scrape
crawl_delay = rp.crawl_delay("*") # Recommended delay
print(f"Crawl delay: {crawl_delay}s")
Rate Limiting Strategies
| Strategy | Implementation | Best For |
|---|---|---|
| Fixed delay | time.sleep(2) |
Simple scrapers |
| Random delay | time.sleep(random.uniform(1, 3)) |
Most use cases |
| Exponential backoff | Double delay on each retry | Handling 429 errors |
| Token bucket | Allow N requests per minute | Production scrapers |
| Adaptive | Slow down when errors increase | Large-scale crawling |
Token Bucket Rate Limiter
import time
from threading import Lock
class RateLimiter:
def __init__(self, requests_per_second=1):
self.rate = requests_per_second
self.last_request = 0
self.lock = Lock()
def wait(self):
with self.lock:
now = time.time()
elapsed = now - self.last_request
wait_time = max(0, (1 / self.rate) - elapsed)
time.sleep(wait_time)
self.last_request = time.time()
Let ScraperAPI Handle It
ScraperAPI and ScrapingAnt manage rate limiting for you. They distribute requests across their proxy pool at optimal speeds for each target site.
Best Practices
- Start slow, Begin with 1 request per 2-3 seconds
- Check
robots.txtfor crawl delay guidelines - Monitor HTTP 429 responses, They mean you are going too fast
- Use random delays, Fixed intervals look robotic
- Scrape during off-peak hours, Less load on the server, fewer blocks
- Log your request rate, Track requests per minute for debugging