Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scraping Behind Login/Authentication

Scrape websites that require login. Handle form-based authentication, session tokens, and authenticated API requests with Python.

Python Scraping · #12intermediate3 min read
Share:WhatsAppLinkedIn

Many websites require you to log in before you can access certain data. Python's requests.Session handles this by persisting cookies across requests, just like a browser.

Form-Based Login

Most login forms submit a POST request. Use your browser's DevTools (Network tab) to inspect the form action URL and field names.

import requests
from bs4 import BeautifulSoup

session = requests.Session()

# Step 1: Get the login page (for CSRF tokens)
login_url = "https://quotes.toscrape.com/login"
login_page = session.get(login_url)
soup = BeautifulSoup(login_page.text, "html.parser")

# Extract CSRF token if present
csrf_token = soup.select_one('input[name="csrf_token"]')
csrf_value = csrf_token["value"] if csrf_token else ""

# Step 2: Submit login credentials
payload = {
    "csrf_token": csrf_value,
    "username": "admin",
    "password": "admin",
}
response = session.post(login_url, data=payload)

if "Logout" in response.text:
    print("Login successful!")
else:
    print("Login failed.")

# Step 3: Scrape authenticated pages
protected_page = session.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(protected_page.text, "html.parser")
for quote in soup.select("div.quote"):
    print(quote.select_one("span.text").get_text()[:60])

Token-Based Authentication (API Keys / Bearer Tokens)

Many APIs use token-based auth in the Authorization header.

import requests

session = requests.Session()
session.headers.update({
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Accept": "application/json",
})

response = session.get("https://api.example.com/protected/data")
data = response.json()

for item in data["results"]:
    print(item["name"])

OAuth2 Login Flow

Some sites use OAuth2, you obtain a token first, then use it for subsequent requests.

import requests

# Step 1: Obtain access token
token_response = requests.post(
    "https://api.example.com/oauth/token",
    data={
        "grant_type": "client_credentials",
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
    },
)
access_token = token_response.json()["access_token"]

# Step 2: Use token to access protected resources
session = requests.Session()
session.headers["Authorization"] = f"Bearer {access_token}"

response = session.get("https://api.example.com/protected/resource")
print(response.json())

Handling Login with Cookies

Sometimes you need to extract and reuse specific cookies.

import requests

session = requests.Session()

# Login
session.post("https://example.com/login", data={
    "username": "user",
    "password": "pass",
})

# Check what cookies we received
for cookie in session.cookies:
    print(f"{cookie.name}: {cookie.value}")

# Save cookies for later use
import json
cookies_dict = session.cookies.get_dict()
with open("cookies.json", "w") as f:
    json.dump(cookies_dict, f)

# Restore cookies in a new session
with open("cookies.json") as f:
    saved_cookies = json.load(f)

new_session = requests.Session()
new_session.cookies.update(saved_cookies)

Tips

  • Never hardcode credentials in your scripts, use environment variables or a .env file.
  • Always check for CSRF tokens in login forms; omitting them will cause login to fail.
  • Use session.get() and session.post() (not requests.get()) to maintain cookies across requests.
  • For sites with complex JavaScript-based logins, consider using a browser automation tool or a service like ScrapingAnt that can handle browser sessions.

Next Steps

  • Dive deeper into cookie and session management
  • Learn to scrape dynamic content that loads via JavaScript