Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Tutorial

Error Handling and Retries in Web Scraping

Build robust web scrapers with proper error handling, retry logic, and failure recovery. Python code examples and best practices.

Production scrapers must handle errors gracefully. Network failures, blocked requests, and unexpected HTML changes are inevitable.

Common Scraping Errors

Error HTTP Code Cause Solution
Timeout N/A Slow server Increase timeout, retry
Forbidden 403 Bot detection Rotate proxy/headers
Rate Limited 429 Too many requests Back off and retry
Not Found 404 Page moved/deleted Log and skip
Server Error 500/503 Server issue Retry with backoff
Connection Error N/A Network issue Retry

Basic Error Handling

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    try:
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
        return BeautifulSoup(resp.text, "html.parser")
    except requests.exceptions.Timeout:
        print(f"Timeout: {url}")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error {e.response.status_code}: {url}")
    except requests.exceptions.ConnectionError:
        print(f"Connection failed: {url}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    return None

Retry with Exponential Backoff

import time
import random
import requests

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=30)
            
            if resp.status_code == 200:
                return resp
            elif resp.status_code == 429:
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited, waiting {wait:.1f}s")
                time.sleep(wait)
            elif resp.status_code in (500, 502, 503):
                time.sleep(2 ** attempt)
            else:
                print(f"Got {resp.status_code} for {url}")
                return None
                
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)
    
    print(f"All retries exhausted for {url}")
    return None

Using the tenacity Library

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type(requests.exceptions.RequestException)
)
def fetch(url):
    resp = requests.get(url, timeout=30)
    resp.raise_for_status()
    return resp

Handling Parse Errors

def safe_extract(soup, selector, attribute=None, default="N/A"):
    element = soup.select_one(selector)
    if element is None:
        return default
    if attribute:
        return element.get(attribute, default)
    return element.text.strip()

# Usage
title = safe_extract(soup, "h1.product-title")
price = safe_extract(soup, "span.price")
image = safe_extract(soup, "img.main", attribute="src")

Let ScraperAPI Handle Retries

ScraperAPI and ScrapingAnt automatically retry failed requests with different proxies, eliminating most transient errors.

# ScraperAPI retries automatically
resp = requests.get(
    f"http://api.scraperapi.com?api_key={API_KEY}&url={url}&autoparse=true"
)

Best Practices

  1. Always set timeouts, Never let requests hang indefinitely
  2. Log all errors with URLs and status codes for debugging
  3. Implement circuit breakers, Stop scraping if error rate exceeds a threshold
  4. Save progress, Checkpoint which URLs are done so you can resume
  5. Monitor success rates, A sudden drop indicates something changed
  6. Use structured error handling, Different errors need different strategies