Handling Cookies and Sessions
Master cookie management and persistent sessions in Python web scraping. Handle session cookies, cookie jars, and cross-request state.
Cookies are how websites remember you between requests. Understanding cookie management is essential for scraping sites that track sessions, remember preferences, or gate content behind interactions.
Why Cookies Matter for Scraping
Websites use cookies to:
- Maintain login sessions
- Track user preferences (language, region)
- Enforce rate limits per session
- Gate content behind consent banners
Without cookies, each of your requests looks like a brand-new visitor, which can trigger CAPTCHAs or block access.
Using requests.Session
A Session object automatically handles cookies across requests.
import requests
session = requests.Session()
# First request sets cookies
session.get("https://httpbin.org/cookies/set/session_id/abc123")
# Second request automatically sends those cookies
response = session.get("https://httpbin.org/cookies")
print(response.json())
# {"cookies": {"session_id": "abc123"}}
Setting Custom Cookies
import requests
session = requests.Session()
# Set cookies manually
session.cookies.set("language", "en")
session.cookies.set("country", "US")
session.cookies.set("consent", "accepted")
response = session.get("https://httpbin.org/cookies")
print(response.json())
# {"cookies": {"language": "en", "country": "US", "consent": "accepted"}}
Persisting Cookies to Disk
Save cookies between script runs so you do not need to log in every time.
import requests
import json
import os
COOKIE_FILE = "session_cookies.json"
def load_session():
session = requests.Session()
if os.path.exists(COOKIE_FILE):
with open(COOKIE_FILE) as f:
cookies = json.load(f)
session.cookies.update(cookies)
print(f"Loaded {len(cookies)} cookies from disk.")
return session
def save_session(session):
cookies = session.cookies.get_dict()
with open(COOKIE_FILE, "w") as f:
json.dump(cookies, f)
print(f"Saved {len(cookies)} cookies to disk.")
# Usage
session = load_session()
response = session.get("https://quotes.toscrape.com/")
save_session(session)
Handling Cookie Consent Banners
Some sites require accepting cookies before showing content.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Get the page, may show consent banner
response = session.get("https://example.com")
# Simulate accepting cookies by posting to the consent endpoint
session.post("https://example.com/consent", data={"accept": "all"})
# Now fetch the actual content with consent cookies set
response = session.get("https://example.com/data")
soup = BeautifulSoup(response.text, "html.parser")
Inspecting Cookies
import requests
session = requests.Session()
session.get("https://quotes.toscrape.com/")
for cookie in session.cookies:
print(f"Name: {cookie.name}")
print(f"Value: {cookie.value}")
print(f"Domain: {cookie.domain}")
print(f"Path: {cookie.path}")
print(f"Secure: {cookie.secure}")
print(f"Expires: {cookie.expires}")
print("---")
Session Best Practices
| Practice | Why |
|---|---|
| Use one session per site | Keeps cookies isolated |
| Set a User-Agent header | Some sites reject default Python UA |
| Handle Set-Cookie headers | Session does this automatically |
| Clear cookies when needed | session.cookies.clear() |
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
Tips
- A
requests.Sessionreuses the underlying TCP connection, making subsequent requests faster. - If you are getting blocked despite correct cookies, a proxy service like ScraperAPI can manage sessions and cookies for you automatically.
- Some anti-bot systems fingerprint your cookie behavior, ScrapingAnt handles this by using real browser sessions.
Next Steps
- Learn to scrape dynamic content that relies on JavaScript
- Explore browser-based scraping for sites with complex cookie requirements