Scraping APIs That Require Cookies
Learn how to handle cookie-based authentication and session management when scraping APIs that rely on browser cookies.
Many websites use cookies to track sessions, enforce authentication, and apply anti-bot measures. If you call their API without the right cookies, you get 403 errors or empty responses.
How Cookie-Based APIs Work
- You visit the website, it sets initial cookies (session ID, CSRF token)
- You log in, the server updates cookies with auth credentials
- Every API call includes these cookies automatically in a browser
- Your scraper must replicate this cookie flow
Using Sessions to Manage Cookies
The requests.Session object automatically stores and sends cookies:
import requests
session = requests.Session()
# Step 1: Visit the homepage to get initial cookies
session.get("https://quotes.toscrape.com/", timeout=15)
print(f"Cookies after homepage: {dict(session.cookies)}")
# Step 2: Log in, session captures the auth cookies
login_response = session.post(
"https://quotes.toscrape.com/login",
data={"username": "admin", "password": "admin"},
timeout=15,
)
print(f"Cookies after login: {dict(session.cookies)}")
# Step 3: Access protected pages with the session
response = session.get("https://quotes.toscrape.com/", timeout=15)
print(f"Logged in: {'Logout' in response.text}")
Extracting Cookies from Your Browser
When manual login is complex (2FA, CAPTCHA), export cookies from your browser:
import requests
# Copy cookies from DevTools > Application > Cookies
cookies = {
"session_id": "abc123def456",
"auth_token": "eyJhbGciOi...",
"_csrf": "x9y8z7w6",
}
session = requests.Session()
session.cookies.update(cookies)
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
})
response = session.get(
"https://www.example.com/api/dashboard",
timeout=15,
)
print(response.json())
Using browser_cookie3 to Auto-Extract Cookies
import browser_cookie3
import requests
# Grab cookies from Chrome for a specific domain
cookies = browser_cookie3.chrome(domain_name=".example.com")
session = requests.Session()
session.cookies = cookies
response = session.get("https://www.example.com/api/profile", timeout=15)
print(response.json())
pip install browser-cookie3
Handling CSRF Tokens
Some APIs require a CSRF token from the HTML page to be sent with each request:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Fetch the page to get the CSRF token
page = session.get("https://www.example.com/login", timeout=15)
soup = BeautifulSoup(page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Include CSRF token in the login request
session.post(
"https://www.example.com/login",
data={
"username": "user",
"password": "pass",
"csrf_token": csrf_token,
},
timeout=15,
)
# Now API calls work with proper session cookies
data = session.get("https://www.example.com/api/orders", timeout=15)
print(data.json())
Cookie Troubleshooting
| Problem | Solution |
|---|---|
| 403 after login | Check if CSRF token or Referer header is missing |
| Cookies expire | Re-authenticate periodically |
| HttpOnly cookies | Use Session object, it handles them correctly |
| SameSite cookies | Ensure Referer and Origin headers match the domain |
When dealing with complex cookie flows behind Cloudflare or similar protections, ScrapingAnt handles the full browser session including cookies, JavaScript execution, and CAPTCHA solving.
Next Steps
- Explore APIs with Postman for easier debugging
- Handle token-based auth alongside cookies
- Build a persistent session manager for long-running scrapers