Error Handling and Retries in Web Scraping

Build robust web scrapers with proper error handling, retry logic, and failure recovery. Python code examples and best practices.

Production scrapers must handle errors gracefully. Network failures, blocked requests, and unexpected HTML changes are inevitable.

Common Scraping Errors

Error	HTTP Code	Cause	Solution
Timeout	N/A	Slow server	Increase timeout, retry
Forbidden	403	Bot detection	Rotate proxy/headers
Rate Limited	429	Too many requests	Back off and retry
Not Found	404	Page moved/deleted	Log and skip
Server Error	500/503	Server issue	Retry with backoff
Connection Error	N/A	Network issue	Retry

Basic Error Handling

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    try:
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
        return BeautifulSoup(resp.text, "html.parser")
    except requests.exceptions.Timeout:
        print(f"Timeout: {url}")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error {e.response.status_code}: {url}")
    except requests.exceptions.ConnectionError:
        print(f"Connection failed: {url}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    return None

Retry with Exponential Backoff

import time
import random
import requests

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, timeout=30)
            
            if resp.status_code == 200:
                return resp
            elif resp.status_code == 429:
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited, waiting {wait:.1f}s")
                time.sleep(wait)
            elif resp.status_code in (500, 502, 503):
                time.sleep(2 ** attempt)
            else:
                print(f"Got {resp.status_code} for {url}")
                return None
                
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)
    
    print(f"All retries exhausted for {url}")
    return None

Using the tenacity Library

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type(requests.exceptions.RequestException)
)
def fetch(url):
    resp = requests.get(url, timeout=30)
    resp.raise_for_status()
    return resp

Handling Parse Errors

def safe_extract(soup, selector, attribute=None, default="N/A"):
    element = soup.select_one(selector)
    if element is None:
        return default
    if attribute:
        return element.get(attribute, default)
    return element.text.strip()

# Usage
title = safe_extract(soup, "h1.product-title")
price = safe_extract(soup, "span.price")
image = safe_extract(soup, "img.main", attribute="src")

Let ScraperAPI Handle Retries

ScraperAPI and ScrapingAnt automatically retry failed requests with different proxies, eliminating most transient errors.

# ScraperAPI retries automatically
resp = requests.get(
    f"http://api.scraperapi.com?api_key={API_KEY}&url={url}&autoparse=true"
)

Best Practices

Always set timeouts, Never let requests hang indefinitely
Log all errors with URLs and status codes for debugging
Implement circuit breakers, Stop scraping if error rate exceeds a threshold
Save progress, Checkpoint which URLs are done so you can resume
Monitor success rates, A sudden drop indicates something changed
Use structured error handling, Different errors need different strategies