Scraping with aiohttp
Use aiohttp for high-performance async web scraping in Python. Learn session management, connection pooling, and concurrent page fetching.
Python Scraping · #8intermediate3 min read
aiohttp is a mature async HTTP client and server library for Python. It excels at making large numbers of concurrent requests with fine-grained control over connection pooling and session management.
Installation
pip install aiohttp beautifulsoup4
Basic aiohttp Scraper
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_page(session, url):
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, "html.parser")
quotes = []
for quote in soup.select("div.quote"):
quotes.append({
"text": quote.select_one("span.text").get_text(),
"author": quote.select_one("small.author").get_text(),
})
return quotes
async def main():
urls = [
f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
]
connector = aiohttp.TCPConnector(limit=10)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch_page(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
all_quotes = []
for result in results:
if isinstance(result, list):
all_quotes.extend(result)
else:
print(f"Error: {result}")
print(f"Scraped {len(all_quotes)} quotes")
asyncio.run(main())
Connection Pooling and Limits
aiohttp gives you direct control over TCP connections:
import aiohttp
# Limit total connections and per-host connections
connector = aiohttp.TCPConnector(
limit=20, # Total concurrent connections
limit_per_host=5, # Max connections per domain
ttl_dns_cache=300, # Cache DNS lookups for 5 minutes
)
# Set default timeout for all requests
timeout = aiohttp.ClientTimeout(total=30, connect=10)
async with aiohttp.ClientSession(
connector=connector,
timeout=timeout,
headers={"User-Agent": "ScrapingCentral Bot/1.0"},
) as session:
# All requests share this session
pass
Handling Errors Gracefully
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def safe_fetch(session, url, retries=3):
for attempt in range(retries):
try:
async with session.get(url) as response:
if response.status == 200:
return await response.text()
elif response.status == 429:
wait = 2 ** attempt
print(f"Rate limited on {url}, waiting {wait}s")
await asyncio.sleep(wait)
else:
print(f"Status {response.status} for {url}")
return None
except aiohttp.ClientError as e:
print(f"Attempt {attempt + 1} failed: {e}")
await asyncio.sleep(1)
return None
aiohttp vs HTTPX
| Feature | aiohttp | HTTPX |
|---|---|---|
| Maturity | Older, battle-tested | Newer, modern API |
| Sync mode | No (async only) | Yes (both) |
| HTTP/2 | No | Yes |
| Connection control | Very granular | Good |
| Server capability | Yes | No |
Tips
- Use a single
ClientSessionfor all requests, creating sessions is expensive. - The
TCPConnector(limit_per_host=N)parameter is your best tool for polite scraping. - For scraping sites that require JavaScript rendering or CAPTCHA solving, route aiohttp requests through ScrapingAnt.
- Always close sessions properly using
async withcontext managers.
Next Steps
- Learn to store scraped data in CSV and JSON files
- Combine aiohttp with a database pipeline for production scrapers