Polite Scraping, robots.txt, Delays, Rate Limits
Stay welcome on the sites you scrape. Respect robots.txt, throttle yourself, identify cleanly, and recognise when you're being told to slow down.
What you’ll learn
- Read and respect `robots.txt`.
- Implement per-domain rate limits with delays and jitter.
- Recognise and honour `Retry-After` and `429` responses.
- Identify your scraper via User-Agent and a contact email.
A scraper's relationship with the target site is parasitic at best, symbiotic at best-case. Politeness is what keeps you in the second category. This is about being a good citizen technically, the legal and ethical layers come later in the curriculum.
robots.txt, what it is, what it does
robots.txt is a plaintext file at the root of a site (/robots.txt) declaring which paths automated agents are allowed to access:
User-agent: *
Disallow: /admin
Disallow: /private/
Crawl-delay: 1
User-agent: Googlebot
Allow: /
Sitemap: https://practice.scrapingcentral.com/sitemap.xml
Three sections you'll see:
User-agent:, which bot the rules apply to.*means everyone.Disallow:, paths the bot must not request. Path prefix matching.Allow:, explicit exceptions to Disallows.Crawl-delay:, seconds between requests (informal, not all servers honour or even publish it).Sitemap:, declares sitemap URLs.
robots.txt is advisory, not enforced. Ignoring it isn't a crime in most jurisdictions, but it's the clearest signal the site has given about its preferences. Honour it unless you have a compelling reason.
Parsing robots.txt in Python
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://practice.scrapingcentral.com/robots.txt")
rp.read()
allowed = rp.can_fetch("MyScraper/1.0", "https://practice.scrapingcentral.com/admin/")
print("Allowed:", allowed) # False if Disallow: /admin
Stdlib, no install needed. For more advanced parsing (caching, async fetch), the protego library is a richer alternative.
Parsing in PHP
require 'vendor/autoload.php';
use Sokil\Robots\Parser;
$parser = new Parser();
$parser->parse(file_get_contents('https://practice.scrapingcentral.com/robots.txt'));
$allowed = $parser->isAllowed('MyScraper/1.0', '/admin/');
Or just parse it yourself, robots.txt is simple enough that a 30-line PHP function handles most cases.
Delays between requests
The single most important politeness measure: don't slam the server.
import time
for url in urls:
fetch(url)
time.sleep(1.0)
Crude but effective. 1 request per second is a good default for unknown sites; 0.5s is fine for most. Aggressive scrapes can go to 5 req/s if the target clearly handles it; absolute machine-gun pace (no delay) is asking for a ban.
Better: token bucket / leaky bucket
For finer control (especially when scraping in parallel):
import time
from collections import deque
class RateLimiter:
def __init__(self, max_per_period, period):
self.max = max_per_period
self.period = period
self.calls = deque()
def acquire(self):
now = time.time()
# Drop expired entries
while self.calls and now - self.calls[0] > self.period:
self.calls.popleft()
if len(self.calls) >= self.max:
sleep_for = self.period - (now - self.calls[0])
time.sleep(sleep_for)
return self.acquire()
self.calls.append(now)
rate = RateLimiter(max_per_period=10, period=60)
for url in urls:
rate.acquire()
fetch(url)
This caps at "no more than 10 in any 60-second window", smoother than fixed delays and more honest with the server.
For more sophisticated rate limiting (per-host, async-aware), the aiolimiter and ratelimit libraries are good.
Add jitter
Always sleep with a small random jitter:
import random
time.sleep(1.0 + random.uniform(0, 0.3))
This prevents synchronized request waves (multiple scrapers, multiple workers) from arriving simultaneously and looking like a coordinated attack.
Honour 429 and Retry-After
When the server says "slow down":
r = requests.get(url)
if r.status_code == 429:
wait = int(r.headers.get("Retry-After", 60))
print(f"Rate limited, sleeping {wait}s")
time.sleep(wait)
# Then retry
Ignoring 429 is the surest way to get IP-banned. The server is being polite by warning you; reciprocate. Retry-After can be a number of seconds OR an HTTP date, handle both, or default to a sensible fallback.
Identify yourself
A custom User-Agent with contact info is the polite default:
UA = "MyResearchScraper/1.0 (+mailto:alice@example.com)"
The (+mailto:...) syntax is convention. If the site owner has a complaint or wants to whitelist you, they can email. Anonymous mass scraping is the noisiest, blockable-est option.
Match this with cooperation in expressions: don't claim to be a browser if you're not. The honest path is "I'm a scraper, here's how to reach me." For sites that explicitly welcome scrapers (most public datasets), this approach actually gets faster rate limits than disguising as a browser.
A polite scraper boilerplate
import time, random, requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class PoliteScraper:
def __init__(self, base_url, min_delay=1.0, ua=None, contact=None):
self.base_url = base_url
self.min_delay = min_delay
self.last_request = 0
self.session = requests.Session()
self.session.headers["User-Agent"] = ua or f"PoliteScraper/1.0 (+{contact or 'noreply@example.com'})"
retry = Retry(total=4, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504],
respect_retry_after_header=True)
self.session.mount("https://", HTTPAdapter(max_retries=retry))
def get(self, path, **kwargs):
elapsed = time.time() - self.last_request
if elapsed < self.min_delay:
time.sleep(self.min_delay - elapsed + random.uniform(0, 0.2))
r = self.session.get(self.base_url + path, timeout=10, **kwargs)
self.last_request = time.time()
return r
This gives you: identifiable UA, retries with backoff, server-honoring Retry-After, throttled inter-request delay with jitter. The shape every production scraper should have.
When ignoring robots.txt is defensible
Three legitimate scenarios:
- You're scraping data the operator publishes elsewhere openly, e.g. a JSON API they document but robots.txt excludes by overreach.
- The robots.txt is obviously stale or misconfigured, disallows everything to all bots even though the site has SEO traffic.
- You're doing genuinely benign research, single-shot academic analysis, low volume.
Even in these cases, throttling, identifying yourself, and being prepared to stop on request are the right defaults. "I read robots.txt and chose not to honour it because [reason]" is a defensible position; "I never read it" is not.
Hands-on lab
Fetch /robots.txt. Parse it with urllib.robotparser (or your PHP equivalent). For 10 sample URLs (some allowed, some Disallowed), verify your scraper would skip the disallowed ones. Then implement a simple per-URL delay and verify by timing 5 requests that they're spaced as expected.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/robots.txtQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.