API Scraping with HTTPX (Async) - API Scraping

Speed up API scraping with HTTPX and Python's asyncio. Learn to make concurrent requests, handle errors, and throttle for politeness.

When you need to scrape hundreds or thousands of API endpoints, sequential requests are painfully slow. HTTPX supports async requests, letting you fetch many URLs concurrently.

Why HTTPX Over Requests?

Feature	Requests	HTTPX
Sync support	Yes	Yes
Async support	No	Yes
HTTP/2	No	Yes
Connection pooling	Via Session	Built-in
Drop-in replacement	,	Mostly compatible API

pip install httpx

Basic Async Scraping

import asyncio
import httpx

async def fetch_user(client, user_id):
    response = await client.get(
        f"https://jsonplaceholder.typicode.com/users/{user_id}",
        timeout=15,
    )
    response.raise_for_status()
    data = response.json()
    return {"id": data["id"], "name": data["name"], "email": data["email"]}

async def main():
    async with httpx.AsyncClient() as client:
        tasks = [fetch_user(client, uid) for uid in range(1, 11)]
        users = await asyncio.gather(*tasks)

    for user in users:
        print(f"{user['id']}: {user['name']} ({user['email']})")

asyncio.run(main())

This fetches all 10 users concurrently instead of one at a time.

Throttled Concurrent Scraping

Unrestricted concurrency will trigger rate limits. Use a semaphore:

import asyncio
import httpx

async def fetch_with_limit(client, url, semaphore):
    async with semaphore:
        response = await client.get(url, timeout=15)
        await asyncio.sleep(0.2)  # Polite delay between requests
        return response.json()

async def main():
    semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests
    urls = [
        f"https://jsonplaceholder.typicode.com/posts/{i}"
        for i in range(1, 101)
    ]

    async with httpx.AsyncClient() as client:
        tasks = [fetch_with_limit(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter out errors
    posts = [r for r in results if isinstance(r, dict)]
    errors = [r for r in results if isinstance(r, Exception)]
    print(f"Fetched {len(posts)} posts, {len(errors)} errors")

asyncio.run(main())

Async with Retry Logic

import asyncio
import httpx

async def fetch_with_retry(client, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = await client.get(url, timeout=15)
            if response.status_code == 429:
                wait = 2 ** attempt
                await asyncio.sleep(wait)
                continue
            response.raise_for_status()
            return response.json()
        except httpx.HTTPError:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(1)

async def main():
    urls = [f"https://jsonplaceholder.typicode.com/posts/{i}" for i in range(1, 51)]

    async with httpx.AsyncClient() as client:
        semaphore = asyncio.Semaphore(5)

        async def bounded_fetch(url):
            async with semaphore:
                return await fetch_with_retry(client, url)

        results = await asyncio.gather(*[bounded_fetch(u) for u in urls])

    print(f"Scraped {len(results)} posts")

asyncio.run(main())

Performance Comparison

For 100 API calls with 200ms server latency:

Approach	Time
Sequential `requests`	~20 seconds
Async HTTPX (10 concurrent)	~2 seconds
Async HTTPX (50 concurrent)	~0.5 seconds

For high-volume async scraping, pair HTTPX with ScraperAPI as a proxy to distribute requests across IPs and avoid detection.

Next Steps

Scrape APIs that require cookies with session management
Build a full async data pipeline
Handle authentication in async workflows