Async Scraping with HTTPX and asyncio
Speed up your scrapers with async Python. Use HTTPX and asyncio to make concurrent HTTP requests and scrape pages in parallel.
Python Scraping · #7intermediate2 min read
Synchronous scrapers wait for each request to finish before sending the next one. Async scraping lets you send many requests concurrently, dramatically reducing total scrape time.
Why Async?
A synchronous scraper that takes 1 second per page needs 100 seconds for 100 pages. An async scraper with 10 concurrent requests finishes in about 10 seconds.
Install HTTPX
pip install httpx beautifulsoup4
Basic Async Scraper
import asyncio
import httpx
from bs4 import BeautifulSoup
async def scrape_page(client, url):
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
quotes = []
for quote in soup.select("div.quote"):
quotes.append({
"text": quote.select_one("span.text").get_text(),
"author": quote.select_one("small.author").get_text(),
})
return quotes
async def main():
urls = [
f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)
]
async with httpx.AsyncClient() as client:
tasks = [scrape_page(client, url) for url in urls]
results = await asyncio.gather(*tasks)
all_quotes = [q for page in results for q in page]
print(f"Scraped {len(all_quotes)} quotes from {len(urls)} pages")
asyncio.run(main())
Controlling Concurrency with Semaphores
Sending too many requests at once can overwhelm the server or get you blocked. Use a semaphore to limit concurrent requests.
import asyncio
import httpx
from bs4 import BeautifulSoup
MAX_CONCURRENT = 5
async def scrape_page(client, url, semaphore):
async with semaphore:
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.select_one("title").get_text()
return {"url": url, "title": title}
async def main():
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
async with httpx.AsyncClient(timeout=30.0) as client:
tasks = [scrape_page(client, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, Exception):
print(f"Error: {result}")
else:
print(result["title"])
asyncio.run(main())
HTTPX vs Requests
| Feature | Requests | HTTPX |
|---|---|---|
| Sync support | Yes | Yes |
| Async support | No | Yes |
| HTTP/2 | No | Yes |
| API compatibility | , | Very similar to Requests |
Tips
- Always use
return_exceptions=Trueinasyncio.gather()so one failed request does not cancel the rest. - Set explicit timeouts with
httpx.AsyncClient(timeout=30.0). - Pair async scraping with ScraperAPI to handle proxy rotation and CAPTCHAs across your concurrent requests.
- HTTPX supports HTTP/2, enable it with
httpx.AsyncClient(http2=True)after installingpip install httpx[http2].
Next Steps
- Explore aiohttp as an alternative async HTTP library
- Add retry logic for failed async requests
- Store results in a database using async database drivers