Python: asyncio, httpx, aiohttp for High Throughput, Production, Scale & Career

The async toolkit for Python scraping at scale. When to reach for asyncio over Scrapy, and how to write a clean async scraper without footguns.

asyncio + an async HTTP client is the leanest path to fast Python scraping. Two libraries dominate: httpx and aiohttp. Both work. Choose based on the rest of your project.

When to use asyncio over Scrapy

Use asyncio when...	Use Scrapy when...
Hammering APIs (mostly JSON)	Mixed HTML + dedup + pipelines
You control the request shape (custom auth, headers)	Standard crawl patterns
Lean script, no Twisted reactor	Multiple spiders, framework leverage
Integrating into an existing async app (FastAPI)	Standalone scraping project

For "fetch 10,000 URLs and parse JSON, save to a database," asyncio is leaner. For "crawl a catalogue with pagination, dedup, and rate-limit," Scrapy wins.

httpx, modern, synchronous OR async

import asyncio
import httpx

async def fetch(client, url):
  r = await client.get(url)
  return r.json()

async def main(urls):
  async with httpx.AsyncClient(
  timeout=10.0,
  limits=httpx.Limits(max_connections=50, max_keepalive_connections=20),
  headers={"User-Agent": "MyScraper/1.0"},
  ) as client:
  tasks = [fetch(client, u) for u in urls]
  return await asyncio.gather(*tasks, return_exceptions=True)

if __name__ == "__main__":
  urls = [f"https://practice.scrapingcentral.com/api/products?page={i}" for i in range(1, 11)]
  results = asyncio.run(main(urls))
  for r in results:
  if isinstance(r, Exception):
  print("err:", r)
  else:
  print(len(r), "items")

Key patterns:

httpx.AsyncClient with limits and timeouts.
async with ensures connections close cleanly.
asyncio.gather runs N coroutines concurrently.
return_exceptions=True collects failures without aborting the whole batch.

httpx supports HTTP/2 out of the box (http2=True). Faster on long polls, irrelevant on single short requests.

aiohttp, older, async-only, lighter

import asyncio
import aiohttp

async def fetch(session, url):
  async with session.get(url) as r:
  return await r.json()

async def main(urls):
  conn = aiohttp.TCPConnector(limit=50, limit_per_host=10)
  async with aiohttp.ClientSession(connector=conn, timeout=aiohttp.ClientTimeout(total=10)) as session:
  return await asyncio.gather(*[fetch(session, u) for u in urls], return_exceptions=True)

Functionally similar. aiohttp is older, used in many existing async codebases, and has slightly lower overhead. httpx has the friendlier API and a sync mode too.

Picking between them

New project, mix sync and async in the same code: httpx.
Existing aiohttp codebase: stay with aiohttp.
HTTP/2 needed: httpx (aiohttp has experimental HTTP/2 support).
Lowest possible overhead: aiohttp.

Both are fine. Most projects could use either.

The three classic asyncio pitfalls

1. Spawning unbounded tasks

# BAD, 1 million tasks at once, OS gives up
tasks = [fetch(u) for u in urls]  # urls is 1 million long
await asyncio.gather(*tasks)

Concurrency must be bounded. Use a semaphore (covered in §4.22) or asyncio.gather over chunks.

2. Mixing sync I/O into async code

# BAD, blocks the entire event loop
async def fetch_and_save(url):
  r = await client.get(url)
  with open("out.txt", "a") as f:  # sync I/O, blocks event loop
  f.write(r.text)

Use aiofiles for async file I/O, or shove writes into a queue consumed by a sync thread. The event loop is one thread; one blocking call halts everything.

3. Forgetting `await`

r = client.get(url)  # returns a coroutine, doesn't fetch
print(r.status_code)  # AttributeError

Easy to do under pressure. Mypy / type checkers catch it. So does running the code, a coroutine warning appears in the log.

Adding politeness, semaphores and rate

sem = asyncio.Semaphore(10)  # max 10 concurrent

async def fetch(client, url):
  async with sem:
  await asyncio.sleep(0.1)  # 10 req/sec floor
  return await client.get(url)

The semaphore limits parallelism; the sleep adds a per-request delay. For more sophisticated rate limiting (token bucket), use aiolimiter or write your own, covered in §4.22.

HTML parsing in async

The HTML parsers (BeautifulSoup, lxml) are sync. That's OK, parsing is CPU-bound and fast. Just call them inside the async function:

async def fetch_product(client, url):
  r = await client.get(url)
  soup = BeautifulSoup(r.text, "lxml")  # sync, but fast, fine
  return {
  "title": soup.select_one("h1").text,
  "price": soup.select_one(".price").text,
  }

For huge HTML where parsing takes meaningful time, push it into asyncio.to_thread():

soup = await asyncio.to_thread(BeautifulSoup, r.text, "lxml")

That offloads parsing to a thread pool, freeing the event loop.

Database writes, async or batched

Postgres via asyncpg is async-native and fast. For other DBs, a queue + sync worker pattern is fine:

queue = asyncio.Queue()

async def fetcher(url):
  data = await fetch(url)
  await queue.put(data)

async def writer():
  batch = []
  while True:
  item = await queue.get()
  batch.append(item)
  if len(batch) >= 100:
  await asyncio.to_thread(db_insert_batch, batch)
  batch = []

Producers fetch; the consumer batches inserts. This is a common, scalable shape.

Hands-on lab

Against /api/products on Catalog108:

Use httpx.AsyncClient to fetch 50 pages concurrently.
Limit concurrency to 8 with asyncio.Semaphore.
Parse JSON, count items per page.
Time the run. Compare to a sync loop hitting the same 50 pages.

The async version should be ~5–10x faster wall-clock for typical API latency. That's where asyncio earns its keep.

Python: asyncio, httpx, aiohttp for High Throughput

What you’ll learn

When to use asyncio over Scrapy

httpx, modern, synchronous OR async

aiohttp, older, async-only, lighter

Picking between them

The three classic asyncio pitfalls

1. Spawning unbounded tasks

2. Mixing sync I/O into async code

3. Forgetting `await`

Adding politeness, semaphores and rate

HTML parsing in async

Database writes, async or batched

Hands-on lab

Hands-on lab

Quiz, check your understanding

When is asyncio + httpx leaner than Scrapy for a scraping task?

Python: asyncio, httpx, aiohttp for High Throughput

What you’ll learn

When to use asyncio over Scrapy

httpx, modern, synchronous OR async

aiohttp, older, async-only, lighter

Picking between them

The three classic asyncio pitfalls

1. Spawning unbounded tasks

2. Mixing sync I/O into async code

3. Forgetting await

Adding politeness, semaphores and rate

HTML parsing in async

Database writes, async or batched

Hands-on lab

Hands-on lab

Quiz, check your understanding

When is asyncio + httpx leaner than Scrapy for a scraping task?

3. Forgetting `await`