Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

API Scraping with HTTPX (Async)

Speed up API scraping with HTTPX and Python's asyncio. Learn to make concurrent requests, handle errors, and throttle for politeness.

API Scraping · #10intermediate3 min read
Share:WhatsAppLinkedIn

When you need to scrape hundreds or thousands of API endpoints, sequential requests are painfully slow. HTTPX supports async requests, letting you fetch many URLs concurrently.

Why HTTPX Over Requests?

Feature Requests HTTPX
Sync support Yes Yes
Async support No Yes
HTTP/2 No Yes
Connection pooling Via Session Built-in
Drop-in replacement , Mostly compatible API
pip install httpx

Basic Async Scraping

import asyncio
import httpx

async def fetch_user(client, user_id):
    response = await client.get(
        f"https://jsonplaceholder.typicode.com/users/{user_id}",
        timeout=15,
    )
    response.raise_for_status()
    data = response.json()
    return {"id": data["id"], "name": data["name"], "email": data["email"]}

async def main():
    async with httpx.AsyncClient() as client:
        tasks = [fetch_user(client, uid) for uid in range(1, 11)]
        users = await asyncio.gather(*tasks)

    for user in users:
        print(f"{user['id']}: {user['name']} ({user['email']})")

asyncio.run(main())

This fetches all 10 users concurrently instead of one at a time.

Throttled Concurrent Scraping

Unrestricted concurrency will trigger rate limits. Use a semaphore:

import asyncio
import httpx

async def fetch_with_limit(client, url, semaphore):
    async with semaphore:
        response = await client.get(url, timeout=15)
        await asyncio.sleep(0.2)  # Polite delay between requests
        return response.json()

async def main():
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests
    urls = [
        f"https://jsonplaceholder.typicode.com/posts/{i}"
        for i in range(1, 101)
    ]

    async with httpx.AsyncClient() as client:
        tasks = [fetch_with_limit(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter out errors
    posts = [r for r in results if isinstance(r, dict)]
    errors = [r for r in results if isinstance(r, Exception)]
    print(f"Fetched {len(posts)} posts, {len(errors)} errors")

asyncio.run(main())

Async with Retry Logic

import asyncio
import httpx

async def fetch_with_retry(client, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = await client.get(url, timeout=15)
            if response.status_code == 429:
                wait = 2 ** attempt
                await asyncio.sleep(wait)
                continue
            response.raise_for_status()
            return response.json()
        except httpx.HTTPError:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(1)

async def main():
    urls = [f"https://jsonplaceholder.typicode.com/posts/{i}" for i in range(1, 51)]

    async with httpx.AsyncClient() as client:
        semaphore = asyncio.Semaphore(5)

        async def bounded_fetch(url):
            async with semaphore:
                return await fetch_with_retry(client, url)

        results = await asyncio.gather(*[bounded_fetch(u) for u in urls])

    print(f"Scraped {len(results)} posts")

asyncio.run(main())

Performance Comparison

For 100 API calls with 200ms server latency:

Approach Time
Sequential requests ~20 seconds
Async HTTPX (10 concurrent) ~2 seconds
Async HTTPX (50 concurrent) ~0.5 seconds

For high-volume async scraping, pair HTTPX with ScraperAPI as a proxy to distribute requests across IPs and avoid detection.

Next Steps

  • Scrape APIs that require cookies with session management
  • Build a full async data pipeline
  • Handle authentication in async workflows