Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Async Scraping with HTTPX and asyncio

Speed up your scrapers with async Python. Use HTTPX and asyncio to make concurrent HTTP requests and scrape pages in parallel.

Python Scraping · #7intermediate2 min read
Share:WhatsAppLinkedIn

Synchronous scrapers wait for each request to finish before sending the next one. Async scraping lets you send many requests concurrently, dramatically reducing total scrape time.

Why Async?

A synchronous scraper that takes 1 second per page needs 100 seconds for 100 pages. An async scraper with 10 concurrent requests finishes in about 10 seconds.

Install HTTPX

pip install httpx beautifulsoup4

Basic Async Scraper

import asyncio
import httpx
from bs4 import BeautifulSoup


async def scrape_page(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    quotes = []
    for quote in soup.select("div.quote"):
        quotes.append({
            "text": quote.select_one("span.text").get_text(),
            "author": quote.select_one("small.author").get_text(),
        })
    return quotes


async def main():
    urls = [
        f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
    ]

    async with httpx.AsyncClient() as client:
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks)

    all_quotes = [q for page in results for q in page]
    print(f"Scraped {len(all_quotes)} quotes from {len(urls)} pages")


asyncio.run(main())

Controlling Concurrency with Semaphores

Sending too many requests at once can overwhelm the server or get you blocked. Use a semaphore to limit concurrent requests.

import asyncio
import httpx
from bs4 import BeautifulSoup

MAX_CONCURRENT = 5


async def scrape_page(client, url, semaphore):
    async with semaphore:
        response = await client.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        title = soup.select_one("title").get_text()
        return {"url": url, "title": title}


async def main():
    urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async with httpx.AsyncClient(timeout=30.0) as client:
        tasks = [scrape_page(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, Exception):
            print(f"Error: {result}")
        else:
            print(result["title"])


asyncio.run(main())

HTTPX vs Requests

Feature Requests HTTPX
Sync support Yes Yes
Async support No Yes
HTTP/2 No Yes
API compatibility , Very similar to Requests

Tips

  • Always use return_exceptions=True in asyncio.gather() so one failed request does not cancel the rest.
  • Set explicit timeouts with httpx.AsyncClient(timeout=30.0).
  • Pair async scraping with ScraperAPI to handle proxy rotation and CAPTCHAs across your concurrent requests.
  • HTTPX supports HTTP/2, enable it with httpx.AsyncClient(http2=True) after installing pip install httpx[http2].

Next Steps

  • Explore aiohttp as an alternative async HTTP library
  • Add retry logic for failed async requests
  • Store results in a database using async database drivers