Tutorial
Error Handling and Retries in Web Scraping
Build robust web scrapers with proper error handling, retry logic, and failure recovery. Python code examples and best practices.
Production scrapers must handle errors gracefully. Network failures, blocked requests, and unexpected HTML changes are inevitable.
Common Scraping Errors
| Error | HTTP Code | Cause | Solution |
|---|---|---|---|
| Timeout | N/A | Slow server | Increase timeout, retry |
| Forbidden | 403 | Bot detection | Rotate proxy/headers |
| Rate Limited | 429 | Too many requests | Back off and retry |
| Not Found | 404 | Page moved/deleted | Log and skip |
| Server Error | 500/503 | Server issue | Retry with backoff |
| Connection Error | N/A | Network issue | Retry |
Basic Error Handling
import requests
from bs4 import BeautifulSoup
def scrape_page(url):
try:
resp = requests.get(url, timeout=30)
resp.raise_for_status()
return BeautifulSoup(resp.text, "html.parser")
except requests.exceptions.Timeout:
print(f"Timeout: {url}")
except requests.exceptions.HTTPError as e:
print(f"HTTP Error {e.response.status_code}: {url}")
except requests.exceptions.ConnectionError:
print(f"Connection failed: {url}")
except Exception as e:
print(f"Unexpected error: {e}")
return None
Retry with Exponential Backoff
import time
import random
import requests
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
resp = requests.get(url, timeout=30)
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited, waiting {wait:.1f}s")
time.sleep(wait)
elif resp.status_code in (500, 502, 503):
time.sleep(2 ** attempt)
else:
print(f"Got {resp.status_code} for {url}")
return None
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt)
print(f"All retries exhausted for {url}")
return None
Using the tenacity Library
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type(requests.exceptions.RequestException)
)
def fetch(url):
resp = requests.get(url, timeout=30)
resp.raise_for_status()
return resp
Handling Parse Errors
def safe_extract(soup, selector, attribute=None, default="N/A"):
element = soup.select_one(selector)
if element is None:
return default
if attribute:
return element.get(attribute, default)
return element.text.strip()
# Usage
title = safe_extract(soup, "h1.product-title")
price = safe_extract(soup, "span.price")
image = safe_extract(soup, "img.main", attribute="src")
Let ScraperAPI Handle Retries
ScraperAPI and ScrapingAnt automatically retry failed requests with different proxies, eliminating most transient errors.
# ScraperAPI retries automatically
resp = requests.get(
f"http://api.scraperapi.com?api_key={API_KEY}&url={url}&autoparse=true"
)
Best Practices
- Always set timeouts, Never let requests hang indefinitely
- Log all errors with URLs and status codes for debugging
- Implement circuit breakers, Stop scraping if error rate exceeds a threshold
- Save progress, Checkpoint which URLs are done so you can resume
- Monitor success rates, A sudden drop indicates something changed
- Use structured error handling, Different errors need different strategies