Async Scraping with HTTPX and asyncio - Python Scraping

Speed up your scrapers with async Python. Use HTTPX and asyncio to make concurrent HTTP requests and scrape pages in parallel.

Synchronous scrapers wait for each request to finish before sending the next one. Async scraping lets you send many requests concurrently, dramatically reducing total scrape time.

Why Async?

A synchronous scraper that takes 1 second per page needs 100 seconds for 100 pages. An async scraper with 10 concurrent requests finishes in about 10 seconds.

Install HTTPX

pip install httpx beautifulsoup4

Basic Async Scraper

import asyncio
import httpx
from bs4 import BeautifulSoup


async def scrape_page(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    quotes = []
    for quote in soup.select("div.quote"):
        quotes.append({
            "text": quote.select_one("span.text").get_text(),
            "author": quote.select_one("small.author").get_text(),
        })
    return quotes


async def main():
    urls = [
        f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
    ]

    async with httpx.AsyncClient() as client:
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks)

    all_quotes = [q for page in results for q in page]
    print(f"Scraped {len(all_quotes)} quotes from {len(urls)} pages")


asyncio.run(main())

Controlling Concurrency with Semaphores

Sending too many requests at once can overwhelm the server or get you blocked. Use a semaphore to limit concurrent requests.

import asyncio
import httpx
from bs4 import BeautifulSoup

MAX_CONCURRENT = 5


async def scrape_page(client, url, semaphore):
    async with semaphore:
        response = await client.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        title = soup.select_one("title").get_text()
        return {"url": url, "title": title}


async def main():
    urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async with httpx.AsyncClient(timeout=30.0) as client:
        tasks = [scrape_page(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, Exception):
            print(f"Error: {result}")
        else:
            print(result["title"])


asyncio.run(main())

HTTPX vs Requests

Feature	Requests	HTTPX
Sync support	Yes	Yes
Async support	No	Yes
HTTP/2	No	Yes
API compatibility	,	Very similar to Requests

Tips

Always use return_exceptions=True in asyncio.gather() so one failed request does not cancel the rest.
Set explicit timeouts with httpx.AsyncClient(timeout=30.0).
Pair async scraping with ScraperAPI to handle proxy rotation and CAPTCHAs across your concurrent requests.
HTTPX supports HTTP/2, enable it with httpx.AsyncClient(http2=True) after installing pip install httpx[http2].

Next Steps

Explore aiohttp as an alternative async HTTP library
Add retry logic for failed async requests
Store results in a database using async database drivers