Guide
Hybrid Scraping - Browser Login + HTTP Requests at Scale
Learn the hybrid scraping technique that uses browser automation for login and authentication, then switches to fast HTTP requests for data extraction.
The hybrid approach combines the power of browser automation for authentication with the speed of HTTP requests for data extraction. This is how professional scrapers handle sites that require login.
The Concept
Browser (slow, expensive) → Login → Extract cookies/tokens
↓
HTTP Client (fast, cheap) → Use cookies → Scrape at scale
Browser automation is slow and resource-intensive. HTTP requests are fast and lightweight. The hybrid approach uses each where it excels.
Implementation
Step 1: Browser Login and Cookie Extraction
from playwright.sync_api import sync_playwright
import json
def get_session_cookies(login_url, username, password):
"""Use a browser to log in and extract session cookies."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# Navigate to login page
page.goto(login_url)
# Fill in credentials
page.fill('input[name="email"]', username)
page.fill('input[name="password"]', password)
page.click('button[type="submit"]')
# Wait for login to complete
page.wait_for_load_state("networkidle")
# Extract all cookies
cookies = context.cookies()
# Also extract any auth tokens from localStorage
tokens = page.evaluate("""
() => ({
access_token: localStorage.getItem('access_token'),
refresh_token: localStorage.getItem('refresh_token')
})
""")
browser.close()
return cookies, tokens
cookies, tokens = get_session_cookies(
"https://example.com/login",
"user@example.com",
"password123"
)
Step 2: Transfer Session to HTTP Client
import requests
from curl_cffi import requests as cffi_requests
def create_http_session(cookies, tokens=None):
"""Create a fast HTTP session with the browser's auth cookies."""
session = cffi_requests.Session(impersonate="chrome136")
# Transfer cookies
for cookie in cookies:
session.cookies.set(
cookie["name"],
cookie["value"],
domain=cookie["domain"],
path=cookie["path"]
)
# Add auth header if token-based
if tokens and tokens.get("access_token"):
session.headers["Authorization"] = f"Bearer {tokens['access_token']}"
return session
session = create_http_session(cookies, tokens)
Step 3: Scrape at Scale with HTTP
import time
from concurrent.futures import ThreadPoolExecutor
def scrape_page(session, url):
"""Fast HTTP-based scraping with browser cookies."""
response = session.get(url)
if response.status_code == 200:
return parse_data(response.text)
return None
# Scrape hundreds of pages quickly
urls = [f"https://example.com/data?page={i}" for i in range(1, 101)]
results = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(scrape_page, session, url) for url in urls]
for future in futures:
result = future.result()
if result:
results.extend(result)
print(f"Scraped {len(results)} records")
Handling Session Expiry
class HybridScraper:
def __init__(self, login_url, username, password):
self.login_url = login_url
self.username = username
self.password = password
self.session = None
self.refresh_session()
def refresh_session(self):
cookies, tokens = get_session_cookies(
self.login_url, self.username, self.password
)
self.session = create_http_session(cookies, tokens)
def scrape(self, url, retry=True):
response = self.session.get(url)
if response.status_code == 401 and retry:
self.refresh_session()
return self.scrape(url, retry=False)
return response
When to Use Hybrid Scraping
- Sites that require login but serve data as static HTML after authentication
- Dashboard scraping where hundreds of pages share the same session
- Sites where browser rendering is only needed for the initial auth flow
- Any scenario where you need speed at scale behind authentication
For sites with heavy anti-bot on every request (not just login), use ScraperAPI instead, as it handles both authentication and anti-bot per request.