Polite Scraping, robots.txt, Delays, Rate Limits, Static Scraping

Stay welcome on the sites you scrape. Respect robots.txt, throttle yourself, identify cleanly, and recognise when you're being told to slow down.

A scraper's relationship with the target site is parasitic at best, symbiotic at best-case. Politeness is what keeps you in the second category. This is about being a good citizen technically, the legal and ethical layers come later in the curriculum.

robots.txt, what it is, what it does

robots.txt is a plaintext file at the root of a site (/robots.txt) declaring which paths automated agents are allowed to access:

User-agent: *
Disallow: /admin
Disallow: /private/
Crawl-delay: 1

User-agent: Googlebot
Allow: /

Sitemap: https://practice.scrapingcentral.com/sitemap.xml

Three sections you'll see:

User-agent:, which bot the rules apply to. * means everyone.
Disallow:, paths the bot must not request. Path prefix matching.
Allow:, explicit exceptions to Disallows.
Crawl-delay:, seconds between requests (informal, not all servers honour or even publish it).
Sitemap:, declares sitemap URLs.

robots.txt is advisory, not enforced. Ignoring it isn't a crime in most jurisdictions, but it's the clearest signal the site has given about its preferences. Honour it unless you have a compelling reason.

Parsing robots.txt in Python

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://practice.scrapingcentral.com/robots.txt")
rp.read()

allowed = rp.can_fetch("MyScraper/1.0", "https://practice.scrapingcentral.com/admin/")
print("Allowed:", allowed)  # False if Disallow: /admin

Stdlib, no install needed. For more advanced parsing (caching, async fetch), the protego library is a richer alternative.

Parsing in PHP

require 'vendor/autoload.php';
use Sokil\Robots\Parser;

$parser = new Parser();
$parser->parse(file_get_contents('https://practice.scrapingcentral.com/robots.txt'));
$allowed = $parser->isAllowed('MyScraper/1.0', '/admin/');

Or just parse it yourself, robots.txt is simple enough that a 30-line PHP function handles most cases.

Delays between requests

The single most important politeness measure: don't slam the server.

import time
for url in urls:
  fetch(url)
  time.sleep(1.0)

Crude but effective. 1 request per second is a good default for unknown sites; 0.5s is fine for most. Aggressive scrapes can go to 5 req/s if the target clearly handles it; absolute machine-gun pace (no delay) is asking for a ban.

Better: token bucket / leaky bucket

For finer control (especially when scraping in parallel):

import time
from collections import deque

class RateLimiter:
  def __init__(self, max_per_period, period):
  self.max = max_per_period
  self.period = period
  self.calls = deque()

  def acquire(self):
  now = time.time()
  # Drop expired entries
  while self.calls and now - self.calls[0] > self.period:
  self.calls.popleft()
  if len(self.calls) >= self.max:
  sleep_for = self.period - (now - self.calls[0])
  time.sleep(sleep_for)
  return self.acquire()
  self.calls.append(now)

rate = RateLimiter(max_per_period=10, period=60)
for url in urls:
  rate.acquire()
  fetch(url)

This caps at "no more than 10 in any 60-second window", smoother than fixed delays and more honest with the server.

For more sophisticated rate limiting (per-host, async-aware), the aiolimiter and ratelimit libraries are good.

Add jitter

Always sleep with a small random jitter:

import random
time.sleep(1.0 + random.uniform(0, 0.3))

This prevents synchronized request waves (multiple scrapers, multiple workers) from arriving simultaneously and looking like a coordinated attack.

Honour 429 and Retry-After

When the server says "slow down":

r = requests.get(url)
if r.status_code == 429:
  wait = int(r.headers.get("Retry-After", 60))
  print(f"Rate limited, sleeping {wait}s")
  time.sleep(wait)
  # Then retry

Ignoring 429 is the surest way to get IP-banned. The server is being polite by warning you; reciprocate. Retry-After can be a number of seconds OR an HTTP date, handle both, or default to a sensible fallback.

Identify yourself

A custom User-Agent with contact info is the polite default:

UA = "MyResearchScraper/1.0 (+mailto:alice@example.com)"

The (+mailto:...) syntax is convention. If the site owner has a complaint or wants to whitelist you, they can email. Anonymous mass scraping is the noisiest, blockable-est option.

Match this with cooperation in expressions: don't claim to be a browser if you're not. The honest path is "I'm a scraper, here's how to reach me." For sites that explicitly welcome scrapers (most public datasets), this approach actually gets faster rate limits than disguising as a browser.

A polite scraper boilerplate

import time, random, requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class PoliteScraper:
  def __init__(self, base_url, min_delay=1.0, ua=None, contact=None):
  self.base_url = base_url
  self.min_delay = min_delay
  self.last_request = 0
  self.session = requests.Session()
  self.session.headers["User-Agent"] = ua or f"PoliteScraper/1.0 (+{contact or 'noreply@example.com'})"
  retry = Retry(total=4, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504],
  respect_retry_after_header=True)
  self.session.mount("https://", HTTPAdapter(max_retries=retry))

  def get(self, path, **kwargs):
  elapsed = time.time() - self.last_request
  if elapsed < self.min_delay:
  time.sleep(self.min_delay - elapsed + random.uniform(0, 0.2))
  r = self.session.get(self.base_url + path, timeout=10, **kwargs)
  self.last_request = time.time()
  return r

This gives you: identifiable UA, retries with backoff, server-honoring Retry-After, throttled inter-request delay with jitter. The shape every production scraper should have.

When ignoring robots.txt is defensible

Three legitimate scenarios:

You're scraping data the operator publishes elsewhere openly, e.g. a JSON API they document but robots.txt excludes by overreach.
The robots.txt is obviously stale or misconfigured, disallows everything to all bots even though the site has SEO traffic.
You're doing genuinely benign research, single-shot academic analysis, low volume.

Even in these cases, throttling, identifying yourself, and being prepared to stop on request are the right defaults. "I read robots.txt and chose not to honour it because [reason]" is a defensible position; "I never read it" is not.

Hands-on lab

Fetch /robots.txt. Parse it with urllib.robotparser (or your PHP equivalent). For 10 sample URLs (some allowed, some Disallowed), verify your scraper would skip the disallowed ones. Then implement a simple per-URL delay and verify by timing 5 requests that they're spaced as expected.

Polite Scraping, robots.txt, Delays, Rate Limits

What you’ll learn

robots.txt, what it is, what it does

Parsing robots.txt in Python

Parsing in PHP

Delays between requests

Better: token bucket / leaky bucket

Add jitter

Honour 429 and Retry-After

Identify yourself

A polite scraper boilerplate

When ignoring robots.txt is defensible

Hands-on lab

Hands-on lab

Quiz, check your understanding

Is `robots.txt` legally enforceable?