Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scraping with aiohttp

Use aiohttp for high-performance async web scraping in Python. Learn session management, connection pooling, and concurrent page fetching.

Python Scraping · #8intermediate3 min read
Share:WhatsAppLinkedIn

aiohttp is a mature async HTTP client and server library for Python. It excels at making large numbers of concurrent requests with fine-grained control over connection pooling and session management.

Installation

pip install aiohttp beautifulsoup4

Basic aiohttp Scraper

import asyncio
import aiohttp
from bs4 import BeautifulSoup


async def fetch_page(session, url):
    async with session.get(url) as response:
        html = await response.text()
        soup = BeautifulSoup(html, "html.parser")
        quotes = []
        for quote in soup.select("div.quote"):
            quotes.append({
                "text": quote.select_one("span.text").get_text(),
                "author": quote.select_one("small.author").get_text(),
            })
        return quotes


async def main():
    urls = [
        f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
    ]

    connector = aiohttp.TCPConnector(limit=10)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_page(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    all_quotes = []
    for result in results:
        if isinstance(result, list):
            all_quotes.extend(result)
        else:
            print(f"Error: {result}")

    print(f"Scraped {len(all_quotes)} quotes")


asyncio.run(main())

Connection Pooling and Limits

aiohttp gives you direct control over TCP connections:

import aiohttp

# Limit total connections and per-host connections
connector = aiohttp.TCPConnector(
    limit=20,          # Total concurrent connections
    limit_per_host=5,  # Max connections per domain
    ttl_dns_cache=300, # Cache DNS lookups for 5 minutes
)

# Set default timeout for all requests
timeout = aiohttp.ClientTimeout(total=30, connect=10)

async with aiohttp.ClientSession(
    connector=connector,
    timeout=timeout,
    headers={"User-Agent": "ScrapingCentral Bot/1.0"},
) as session:
    # All requests share this session
    pass

Handling Errors Gracefully

import asyncio
import aiohttp
from bs4 import BeautifulSoup


async def safe_fetch(session, url, retries=3):
    for attempt in range(retries):
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    return await response.text()
                elif response.status == 429:
                    wait = 2 ** attempt
                    print(f"Rate limited on {url}, waiting {wait}s")
                    await asyncio.sleep(wait)
                else:
                    print(f"Status {response.status} for {url}")
                    return None
        except aiohttp.ClientError as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            await asyncio.sleep(1)
    return None

aiohttp vs HTTPX

Feature aiohttp HTTPX
Maturity Older, battle-tested Newer, modern API
Sync mode No (async only) Yes (both)
HTTP/2 No Yes
Connection control Very granular Good
Server capability Yes No

Tips

  • Use a single ClientSession for all requests, creating sessions is expensive.
  • The TCPConnector(limit_per_host=N) parameter is your best tool for polite scraping.
  • For scraping sites that require JavaScript rendering or CAPTCHA solving, route aiohttp requests through ScrapingAnt.
  • Always close sessions properly using async with context managers.

Next Steps

  • Learn to store scraped data in CSV and JSON files
  • Combine aiohttp with a database pipeline for production scrapers