Tutorial
How to Scrape Websites Behind Login Pages
Learn techniques for scraping websites that require authentication. Covers session handling, form login, OAuth, and cookie management.
Many valuable data sources require authentication. Here is how to scrape content behind login pages while maintaining valid sessions.
Method 1: Form-Based Login
The most common approach for traditional websites.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Step 1: Get the login page (grab CSRF token if needed)
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Step 2: Submit login credentials
login_data = {
"username": "your_username",
"password": "your_password",
"csrf_token": csrf_token,
}
resp = session.post("https://example.com/login", data=login_data)
# Step 3: Access protected pages (session cookies are maintained)
protected_page = session.get("https://example.com/dashboard")
Method 2: Browser Cookie Export
Log in manually, then export your session cookies.
- Log into the website in your browser
- Use a browser extension like "EditThisCookie" or DevTools to export cookies
- Load those cookies into your scraper
import requests
session = requests.Session()
session.cookies.set("session_id", "your_session_cookie_value", domain="example.com")
session.cookies.set("auth_token", "your_auth_token", domain="example.com")
resp = session.get("https://example.com/protected-data")
Method 3: Browser Automation
For complex login flows (2FA, OAuth, CAPTCHAs), use Playwright.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com/login")
page.fill("#username", "your_username")
page.fill("#password", "your_password")
page.click("button[type='submit']")
# Wait for login to complete
page.wait_for_url("**/dashboard")
# Now scrape authenticated content
page.goto("https://example.com/data")
content = page.content()
browser.close()
Using ScraperAPI with Sessions
ScraperAPI supports session persistence, allowing you to maintain login state across requests.
API_KEY = "YOUR_SCRAPERAPI_KEY"
SESSION = "mysession123"
# Login request
resp = requests.get(
f"http://api.scraperapi.com?api_key={API_KEY}&url=https://example.com/login&session_number={SESSION}&render=true"
)
Security Considerations
| Practice | Do | Don't |
|---|---|---|
| Credentials | Use environment variables | Hardcode in scripts |
| Sessions | Reuse and refresh | Create new ones per request |
| 2FA | Use app-based tokens | Share SMS codes |
| Tokens | Rotate regularly | Use indefinitely |
Best Practices
- Use
requests.Session(), It handles cookies automatically - Store credentials in environment variables, Never commit passwords to code
- Handle CSRF tokens, Many login forms require them
- Use ScrapingAnt for complex auth flows, Their browser rendering handles OAuth redirects
- Refresh sessions before they expire
- Only scrape data you are authorized to access, Having login credentials does not mean all data is fair game