Python: asyncio, httpx, aiohttp for High Throughput
The async toolkit for Python scraping at scale. When to reach for asyncio over Scrapy, and how to write a clean async scraper without footguns.
What you’ll learn
- Write a concurrent scraper with asyncio + httpx.
- Distinguish httpx, aiohttp, and Scrapy's async story.
- Avoid the three classic asyncio scraping pitfalls.
asyncio + an async HTTP client is the leanest path to fast Python scraping. Two libraries dominate: httpx and aiohttp. Both work. Choose based on the rest of your project.
When to use asyncio over Scrapy
| Use asyncio when... | Use Scrapy when... |
|---|---|
| Hammering APIs (mostly JSON) | Mixed HTML + dedup + pipelines |
| You control the request shape (custom auth, headers) | Standard crawl patterns |
| Lean script, no Twisted reactor | Multiple spiders, framework leverage |
| Integrating into an existing async app (FastAPI) | Standalone scraping project |
For "fetch 10,000 URLs and parse JSON, save to a database," asyncio is leaner. For "crawl a catalogue with pagination, dedup, and rate-limit," Scrapy wins.
httpx, modern, synchronous OR async
import asyncio
import httpx
async def fetch(client, url):
r = await client.get(url)
return r.json()
async def main(urls):
async with httpx.AsyncClient(
timeout=10.0,
limits=httpx.Limits(max_connections=50, max_keepalive_connections=20),
headers={"User-Agent": "MyScraper/1.0"},
) as client:
tasks = [fetch(client, u) for u in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
if __name__ == "__main__":
urls = [f"https://practice.scrapingcentral.com/api/products?page={i}" for i in range(1, 11)]
results = asyncio.run(main(urls))
for r in results:
if isinstance(r, Exception):
print("err:", r)
else:
print(len(r), "items")
Key patterns:
httpx.AsyncClientwith limits and timeouts.async withensures connections close cleanly.asyncio.gatherruns N coroutines concurrently.return_exceptions=Truecollects failures without aborting the whole batch.
httpx supports HTTP/2 out of the box (http2=True). Faster on long polls, irrelevant on single short requests.
aiohttp, older, async-only, lighter
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as r:
return await r.json()
async def main(urls):
conn = aiohttp.TCPConnector(limit=50, limit_per_host=10)
async with aiohttp.ClientSession(connector=conn, timeout=aiohttp.ClientTimeout(total=10)) as session:
return await asyncio.gather(*[fetch(session, u) for u in urls], return_exceptions=True)
Functionally similar. aiohttp is older, used in many existing async codebases, and has slightly lower overhead. httpx has the friendlier API and a sync mode too.
Picking between them
- New project, mix sync and async in the same code: httpx.
- Existing aiohttp codebase: stay with aiohttp.
- HTTP/2 needed: httpx (aiohttp has experimental HTTP/2 support).
- Lowest possible overhead: aiohttp.
Both are fine. Most projects could use either.
The three classic asyncio pitfalls
1. Spawning unbounded tasks
# BAD, 1 million tasks at once, OS gives up
tasks = [fetch(u) for u in urls] # urls is 1 million long
await asyncio.gather(*tasks)
Concurrency must be bounded. Use a semaphore (covered in §4.22) or asyncio.gather over chunks.
2. Mixing sync I/O into async code
# BAD, blocks the entire event loop
async def fetch_and_save(url):
r = await client.get(url)
with open("out.txt", "a") as f: # sync I/O, blocks event loop
f.write(r.text)
Use aiofiles for async file I/O, or shove writes into a queue consumed by a sync thread. The event loop is one thread; one blocking call halts everything.
3. Forgetting await
r = client.get(url) # returns a coroutine, doesn't fetch
print(r.status_code) # AttributeError
Easy to do under pressure. Mypy / type checkers catch it. So does running the code, a coroutine warning appears in the log.
Adding politeness, semaphores and rate
sem = asyncio.Semaphore(10) # max 10 concurrent
async def fetch(client, url):
async with sem:
await asyncio.sleep(0.1) # 10 req/sec floor
return await client.get(url)
The semaphore limits parallelism; the sleep adds a per-request delay. For more sophisticated rate limiting (token bucket), use aiolimiter or write your own, covered in §4.22.
HTML parsing in async
The HTML parsers (BeautifulSoup, lxml) are sync. That's OK, parsing is CPU-bound and fast. Just call them inside the async function:
async def fetch_product(client, url):
r = await client.get(url)
soup = BeautifulSoup(r.text, "lxml") # sync, but fast, fine
return {
"title": soup.select_one("h1").text,
"price": soup.select_one(".price").text,
}
For huge HTML where parsing takes meaningful time, push it into asyncio.to_thread():
soup = await asyncio.to_thread(BeautifulSoup, r.text, "lxml")
That offloads parsing to a thread pool, freeing the event loop.
Database writes, async or batched
Postgres via asyncpg is async-native and fast. For other DBs, a queue + sync worker pattern is fine:
queue = asyncio.Queue()
async def fetcher(url):
data = await fetch(url)
await queue.put(data)
async def writer():
batch = []
while True:
item = await queue.get()
batch.append(item)
if len(batch) >= 100:
await asyncio.to_thread(db_insert_batch, batch)
batch = []
Producers fetch; the consumer batches inserts. This is a common, scalable shape.
Hands-on lab
Against /api/products on Catalog108:
- Use httpx.AsyncClient to fetch 50 pages concurrently.
- Limit concurrency to 8 with
asyncio.Semaphore. - Parse JSON, count items per page.
- Time the run. Compare to a sync loop hitting the same 50 pages.
The async version should be ~5–10x faster wall-clock for typical API latency. That's where asyncio earns its keep.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.