Error Handling and Retries in Scrapers
Build robust scrapers with proper error handling, automatic retries, exponential backoff, and graceful failure recovery.
Python Scraping · #11intermediate3 min read
Web scraping is inherently unreliable. Servers go down, connections time out, and pages change without warning. A production scraper must handle all of these gracefully.
Common Errors in Scraping
| Error | Cause | Solution |
|---|---|---|
ConnectionError |
Server unreachable | Retry with backoff |
Timeout |
Slow response | Set timeout, retry |
| HTTP 403 | Blocked/Forbidden | Rotate user agents or use proxies |
| HTTP 429 | Rate limited | Slow down, add delays |
| HTTP 500 | Server error | Retry later |
AttributeError |
HTML structure changed | Validate selectors |
Basic Error Handling
import requests
from bs4 import BeautifulSoup
def scrape_page(url):
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raises exception for 4xx/5xx
except requests.exceptions.Timeout:
print(f"Timeout: {url}")
return None
except requests.exceptions.HTTPError as e:
print(f"HTTP Error {e.response.status_code}: {url}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("title")
return title.get_text() if title else "No title found"
result = scrape_page("https://quotes.toscrape.com/")
print(result)
Retry with Exponential Backoff
import time
import requests
def fetch_with_retry(url, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
if response.status_code == 200:
return response
if response.status_code == 429:
delay = base_delay * (2 ** attempt)
print(f"Rate limited. Retrying in {delay}s...")
time.sleep(delay)
continue
if response.status_code >= 500:
delay = base_delay * (2 ** attempt)
print(f"Server error {response.status_code}. Retrying in {delay}s...")
time.sleep(delay)
continue
# 4xx errors (except 429), don't retry
print(f"Client error {response.status_code} for {url}")
return None
except requests.exceptions.RequestException as e:
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
print(f"All {max_retries} retries failed for {url}")
return None
Using the tenacity Library
The tenacity library provides a clean decorator-based retry mechanism.
pip install tenacity
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
retry=retry_if_exception_type(requests.exceptions.RequestException),
)
def fetch_url(url):
response = requests.get(url, timeout=10)
response.raise_for_status()
return response
try:
resp = fetch_url("https://quotes.toscrape.com/")
print(f"Success: {resp.status_code}")
except Exception as e:
print(f"Failed after retries: {e}")
Using requests Session with Retry Adapter
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session(retries=3, backoff_factor=0.5):
session = requests.Session()
retry_strategy = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
session = create_session()
response = session.get("https://quotes.toscrape.com/", timeout=10)
print(response.status_code)
Tips
- Always set a
timeouton every request, never let a request hang indefinitely. - Use exponential backoff to avoid hammering a struggling server.
- Log errors with the URL so you can reprocess failed pages later.
- Proxy services like ScraperAPI and ScrapingAnt handle retries and IP rotation automatically, reducing the error-handling burden on your code.
Next Steps
- Learn to handle login-protected pages and authentication
- Build scrapers that manage cookies and sessions