Scraping with aiohttp - Python Scraping

Use aiohttp for high-performance async web scraping in Python. Learn session management, connection pooling, and concurrent page fetching.

aiohttp is a mature async HTTP client and server library for Python. It excels at making large numbers of concurrent requests with fine-grained control over connection pooling and session management.

Installation

pip install aiohttp beautifulsoup4

Basic aiohttp Scraper

import asyncio
import aiohttp
from bs4 import BeautifulSoup


async def fetch_page(session, url):
    async with session.get(url) as response:
        html = await response.text()
        soup = BeautifulSoup(html, "html.parser")
        quotes = []
        for quote in soup.select("div.quote"):
            quotes.append({
                "text": quote.select_one("span.text").get_text(),
                "author": quote.select_one("small.author").get_text(),
            })
        return quotes


async def main():
    urls = [
        f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
    ]

    connector = aiohttp.TCPConnector(limit=10)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_page(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    all_quotes = []
    for result in results:
        if isinstance(result, list):
            all_quotes.extend(result)
        else:
            print(f"Error: {result}")

    print(f"Scraped {len(all_quotes)} quotes")


asyncio.run(main())

Connection Pooling and Limits

aiohttp gives you direct control over TCP connections:

import aiohttp

# Limit total connections and per-host connections
connector = aiohttp.TCPConnector(
    limit=20,          # Total concurrent connections
    limit_per_host=5,  # Max connections per domain
    ttl_dns_cache=300, # Cache DNS lookups for 5 minutes
)

# Set default timeout for all requests
timeout = aiohttp.ClientTimeout(total=30, connect=10)

async with aiohttp.ClientSession(
    connector=connector,
    timeout=timeout,
    headers={"User-Agent": "ScrapingCentral Bot/1.0"},
) as session:
    # All requests share this session
    pass

Handling Errors Gracefully

import asyncio
import aiohttp
from bs4 import BeautifulSoup


async def safe_fetch(session, url, retries=3):
    for attempt in range(retries):
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    return await response.text()
                elif response.status == 429:
                    wait = 2 ** attempt
                    print(f"Rate limited on {url}, waiting {wait}s")
                    await asyncio.sleep(wait)
                else:
                    print(f"Status {response.status} for {url}")
                    return None
        except aiohttp.ClientError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            await asyncio.sleep(1)
    return None

aiohttp vs HTTPX

Feature	aiohttp	HTTPX
Maturity	Older, battle-tested	Newer, modern API
Sync mode	No (async only)	Yes (both)
HTTP/2	No	Yes
Connection control	Very granular	Good
Server capability	Yes	No

Tips

Use a single ClientSession for all requests, creating sessions is expensive.
The TCPConnector(limit_per_host=N) parameter is your best tool for polite scraping.
For scraping sites that require JavaScript rendering or CAPTCHA solving, route aiohttp requests through ScrapingAnt.
Always close sessions properly using async with context managers.

Next Steps

Learn to store scraped data in CSV and JSON files
Combine aiohttp with a database pipeline for production scrapers