API Scraping with HTTPX (Async)
Speed up API scraping with HTTPX and Python's asyncio. Learn to make concurrent requests, handle errors, and throttle for politeness.
API Scraping · #10intermediate3 min read
When you need to scrape hundreds or thousands of API endpoints, sequential requests are painfully slow. HTTPX supports async requests, letting you fetch many URLs concurrently.
Why HTTPX Over Requests?
| Feature | Requests | HTTPX |
|---|---|---|
| Sync support | Yes | Yes |
| Async support | No | Yes |
| HTTP/2 | No | Yes |
| Connection pooling | Via Session | Built-in |
| Drop-in replacement | , | Mostly compatible API |
pip install httpx
Basic Async Scraping
import asyncio
import httpx
async def fetch_user(client, user_id):
response = await client.get(
f"https://jsonplaceholder.typicode.com/users/{user_id}",
timeout=15,
)
response.raise_for_status()
data = response.json()
return {"id": data["id"], "name": data["name"], "email": data["email"]}
async def main():
async with httpx.AsyncClient() as client:
tasks = [fetch_user(client, uid) for uid in range(1, 11)]
users = await asyncio.gather(*tasks)
for user in users:
print(f"{user['id']}: {user['name']} ({user['email']})")
asyncio.run(main())
This fetches all 10 users concurrently instead of one at a time.
Throttled Concurrent Scraping
Unrestricted concurrency will trigger rate limits. Use a semaphore:
import asyncio
import httpx
async def fetch_with_limit(client, url, semaphore):
async with semaphore:
response = await client.get(url, timeout=15)
await asyncio.sleep(0.2) # Polite delay between requests
return response.json()
async def main():
semaphore = asyncio.Semaphore(10) # Max 10 concurrent requests
urls = [
f"https://jsonplaceholder.typicode.com/posts/{i}"
for i in range(1, 101)
]
async with httpx.AsyncClient() as client:
tasks = [fetch_with_limit(client, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Filter out errors
posts = [r for r in results if isinstance(r, dict)]
errors = [r for r in results if isinstance(r, Exception)]
print(f"Fetched {len(posts)} posts, {len(errors)} errors")
asyncio.run(main())
Async with Retry Logic
import asyncio
import httpx
async def fetch_with_retry(client, url, max_retries=3):
for attempt in range(max_retries):
try:
response = await client.get(url, timeout=15)
if response.status_code == 429:
wait = 2 ** attempt
await asyncio.sleep(wait)
continue
response.raise_for_status()
return response.json()
except httpx.HTTPError:
if attempt == max_retries - 1:
raise
await asyncio.sleep(1)
async def main():
urls = [f"https://jsonplaceholder.typicode.com/posts/{i}" for i in range(1, 51)]
async with httpx.AsyncClient() as client:
semaphore = asyncio.Semaphore(5)
async def bounded_fetch(url):
async with semaphore:
return await fetch_with_retry(client, url)
results = await asyncio.gather(*[bounded_fetch(u) for u in urls])
print(f"Scraped {len(results)} posts")
asyncio.run(main())
Performance Comparison
For 100 API calls with 200ms server latency:
| Approach | Time |
|---|---|
Sequential requests |
~20 seconds |
| Async HTTPX (10 concurrent) | ~2 seconds |
| Async HTTPX (50 concurrent) | ~0.5 seconds |
For high-volume async scraping, pair HTTPX with ScraperAPI as a proxy to distribute requests across IPs and avoid detection.
Next Steps
- Scrape APIs that require cookies with session management
- Build a full async data pipeline
- Handle authentication in async workflows