Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Error Handling and Retries in Scrapers

Build robust scrapers with proper error handling, automatic retries, exponential backoff, and graceful failure recovery.

Python Scraping · #11intermediate3 min read
Share:WhatsAppLinkedIn

Web scraping is inherently unreliable. Servers go down, connections time out, and pages change without warning. A production scraper must handle all of these gracefully.

Common Errors in Scraping

Error Cause Solution
ConnectionError Server unreachable Retry with backoff
Timeout Slow response Set timeout, retry
HTTP 403 Blocked/Forbidden Rotate user agents or use proxies
HTTP 429 Rate limited Slow down, add delays
HTTP 500 Server error Retry later
AttributeError HTML structure changed Validate selectors

Basic Error Handling

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises exception for 4xx/5xx
    except requests.exceptions.Timeout:
        print(f"Timeout: {url}")
        return None
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error {e.response.status_code}: {url}")
        return None
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

    soup = BeautifulSoup(response.text, "html.parser")
    title = soup.select_one("title")
    return title.get_text() if title else "No title found"

result = scrape_page("https://quotes.toscrape.com/")
print(result)

Retry with Exponential Backoff

import time
import requests


def fetch_with_retry(url, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            if response.status_code == 200:
                return response

            if response.status_code == 429:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limited. Retrying in {delay}s...")
                time.sleep(delay)
                continue

            if response.status_code >= 500:
                delay = base_delay * (2 ** attempt)
                print(f"Server error {response.status_code}. Retrying in {delay}s...")
                time.sleep(delay)
                continue

            # 4xx errors (except 429), don't retry
            print(f"Client error {response.status_code} for {url}")
            return None

        except requests.exceptions.RequestException as e:
            delay = base_delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)

    print(f"All {max_retries} retries failed for {url}")
    return None

Using the tenacity Library

The tenacity library provides a clean decorator-based retry mechanism.

pip install tenacity
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10),
    retry=retry_if_exception_type(requests.exceptions.RequestException),
)
def fetch_url(url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response


try:
    resp = fetch_url("https://quotes.toscrape.com/")
    print(f"Success: {resp.status_code}")
except Exception as e:
    print(f"Failed after retries: {e}")

Using requests Session with Retry Adapter

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def create_session(retries=3, backoff_factor=0.5):
    session = requests.Session()
    retry_strategy = Retry(
        total=retries,
        backoff_factor=backoff_factor,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session


session = create_session()
response = session.get("https://quotes.toscrape.com/", timeout=10)
print(response.status_code)

Tips

  • Always set a timeout on every request, never let a request hang indefinitely.
  • Use exponential backoff to avoid hammering a struggling server.
  • Log errors with the URL so you can reprocess failed pages later.
  • Proxy services like ScraperAPI and ScrapingAnt handle retries and IP rotation automatically, reducing the error-handling burden on your code.

Next Steps

  • Learn to handle login-protected pages and authentication
  • Build scrapers that manage cookies and sessions