Scraping Behind Login/Authentication
Scrape websites that require login. Handle form-based authentication, session tokens, and authenticated API requests with Python.
Python Scraping · #12intermediate3 min read
Many websites require you to log in before you can access certain data. Python's requests.Session handles this by persisting cookies across requests, just like a browser.
Form-Based Login
Most login forms submit a POST request. Use your browser's DevTools (Network tab) to inspect the form action URL and field names.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Step 1: Get the login page (for CSRF tokens)
login_url = "https://quotes.toscrape.com/login"
login_page = session.get(login_url)
soup = BeautifulSoup(login_page.text, "html.parser")
# Extract CSRF token if present
csrf_token = soup.select_one('input[name="csrf_token"]')
csrf_value = csrf_token["value"] if csrf_token else ""
# Step 2: Submit login credentials
payload = {
"csrf_token": csrf_value,
"username": "admin",
"password": "admin",
}
response = session.post(login_url, data=payload)
if "Logout" in response.text:
print("Login successful!")
else:
print("Login failed.")
# Step 3: Scrape authenticated pages
protected_page = session.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(protected_page.text, "html.parser")
for quote in soup.select("div.quote"):
print(quote.select_one("span.text").get_text()[:60])
Token-Based Authentication (API Keys / Bearer Tokens)
Many APIs use token-based auth in the Authorization header.
import requests
session = requests.Session()
session.headers.update({
"Authorization": "Bearer YOUR_API_TOKEN",
"Accept": "application/json",
})
response = session.get("https://api.example.com/protected/data")
data = response.json()
for item in data["results"]:
print(item["name"])
OAuth2 Login Flow
Some sites use OAuth2, you obtain a token first, then use it for subsequent requests.
import requests
# Step 1: Obtain access token
token_response = requests.post(
"https://api.example.com/oauth/token",
data={
"grant_type": "client_credentials",
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
},
)
access_token = token_response.json()["access_token"]
# Step 2: Use token to access protected resources
session = requests.Session()
session.headers["Authorization"] = f"Bearer {access_token}"
response = session.get("https://api.example.com/protected/resource")
print(response.json())
Handling Login with Cookies
Sometimes you need to extract and reuse specific cookies.
import requests
session = requests.Session()
# Login
session.post("https://example.com/login", data={
"username": "user",
"password": "pass",
})
# Check what cookies we received
for cookie in session.cookies:
print(f"{cookie.name}: {cookie.value}")
# Save cookies for later use
import json
cookies_dict = session.cookies.get_dict()
with open("cookies.json", "w") as f:
json.dump(cookies_dict, f)
# Restore cookies in a new session
with open("cookies.json") as f:
saved_cookies = json.load(f)
new_session = requests.Session()
new_session.cookies.update(saved_cookies)
Tips
- Never hardcode credentials in your scripts, use environment variables or a
.envfile. - Always check for CSRF tokens in login forms; omitting them will cause login to fail.
- Use
session.get()andsession.post()(notrequests.get()) to maintain cookies across requests. - For sites with complex JavaScript-based logins, consider using a browser automation tool or a service like ScrapingAnt that can handle browser sessions.
Next Steps
- Dive deeper into cookie and session management
- Learn to scrape dynamic content that loads via JavaScript