Persistent Contexts and Browser Profiles
Save a logged-in session once, replay it forever. The pattern that turns five-minute auth flows into 50-millisecond cookie injections.
What you’ll learn
- Distinguish `storage_state` (serialised cookies/storage) from persistent contexts (full profile on disk).
- Capture a session from an interactive login and reuse it in headless production runs.
- Bootstrap a session via Playwright then transfer cookies to plain `requests`.
- Choose the right strategy for short-lived vs long-lived auth tokens.
Authentication is expensive. A login flow involves typing credentials, sometimes a 2FA prompt, sometimes a CAPTCHA. Doing it on every scraper run wastes minutes, and gets you flagged for too-frequent logins. The right move is to authenticate once, persist the session, and replay it indefinitely (until it expires). Playwright gives you two tools.
Two persistence shapes
storage_state |
Persistent context | |
|---|---|---|
| What it stores | Cookies + localStorage as JSON | Entire browser profile on disk (history, cache, extensions) |
| Lifecycle | Saved/loaded explicitly | Persists automatically between runs |
| Size | Tens of KB | Hundreds of MB |
| Use case | Most scraping, small, portable, version-controllable | Heavy automation needing full profile |
| Sharing | Easy, JSON file | Hard, tied to a disk path |
storage_state is what you want for production scraping. Persistent contexts are heavier, useful for browser extensions, complex login state, or specific anti-bot evasion (some systems trust profiles that have history).
Saving storage_state
Run a one-time interactive script:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://practice.scrapingcentral.com/account/login")
# Fill credentials manually OR programmatically
page.locator("#email").fill("demo@example.com")
page.locator("#password").fill("password")
page.locator("button[type=submit]").click()
page.wait_for_url("**/account/dashboard")
# Save the state
context.storage_state(path="auth.json")
browser.close()
auth.json now contains every cookie and localStorage entry for the session, typically 5-50 KB of JSON. Check it into a private repo or store in a secrets manager.
Loading storage_state
In every production scraper:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context(storage_state="auth.json")
page = context.new_page()
page.goto("https://practice.scrapingcentral.com/account/dashboard")
# Already logged in, no auth flow needed.
print(page.locator("h1").inner_text())
browser.close()
storage_state is passed to new_context(). The context starts up with cookies pre-loaded; the first request to the site is already authenticated. No login flow, no credentials in your scraper code.
Persistent contexts
When you need more than just cookies, extension state, full history, custom Chrome flags persisted, use launch_persistent_context:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
context = p.chromium.launch_persistent_context(
user_data_dir="./chrome-profile",
headless=False,
)
page = context.new_page()
page.goto("https://practice.scrapingcentral.com/account/dashboard")
# If you logged in once in this profile, you're still logged in.
context.close()
The profile directory accumulates everything Chrome would normally store in ~/.config/google-chrome/Default/. Subsequent launches against the same user_data_dir pick up exactly where you left off.
Caveat: a persistent context replaces the browser launch, there's no separate browser object. You get one context (multiple pages allowed) tied to the profile.
Transferring cookies to plain requests
The hybrid pattern from Lesson 2.3: log in with the browser, scrape the bulk with requests.
from playwright.sync_api import sync_playwright
import requests
# Step 1: get an authenticated context
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context(storage_state="auth.json")
cookies = context.cookies()
browser.close()
# Step 2: feed cookies into a requests Session
session = requests.Session()
for c in cookies:
session.cookies.set(c["name"], c["value"], domain=c["domain"], path=c["path"])
# Step 3: scrape at HTTP speed
r = session.get("https://practice.scrapingcentral.com/api/account/orders")
orders = r.json()
print(orders)
The browser session sets up the auth state in auth.json. The cookies transfer cleanly to requests. From here, your scraper runs 10-50× faster than the equivalent browser-only version.
Caveats:
-
HttpOnlycookies are usable here because we go throughcontext.cookies(), not through JS. -
Some APIs check
Origin/Referer/User-Agentheaders. Set them on therequests.Sessionto match the browser:session.headers.update({ "User-Agent": "Mozilla/5.0 ...", "Origin": "https://practice.scrapingcentral.com", "Referer": "https://practice.scrapingcentral.com/account/dashboard", })
Detecting an expired session
Sessions expire. Your scraper needs to detect this and re-authenticate:
def scrape_with_retry(url):
r = session.get(url)
if r.status_code in {401, 403} or "Sign in" in r.text:
print("Session expired, re-authenticating...")
refresh_auth()
r = session.get(url)
return r
refresh_auth() re-runs the interactive flow (or, in a CI context, runs a headless login with credentials from a secrets manager) and updates auth.json. The next scraper run starts fresh.
For very long sessions, just rotate periodically, re-authenticate every N hours regardless of whether the current session still works.
Storage state for multiple accounts
Persist one state file per account:
for account in ["account_a.json", "account_b.json", "account_c.json"]:
context = browser.new_context(storage_state=account)
page = context.new_page()
# scrape as this account
context.close()
Each context is isolated. You can also run them in parallel (Lesson 2.26).
Security note
auth.json is the equivalent of a username and password. Treat it like a secret:
- Never commit to a public repo.
- Encrypt at rest (e.g., via SOPS, AWS KMS, or environment-injected at runtime).
- Rotate when leaked.
For team-shared scrapers, store the file in a secrets manager (AWS Secrets Manager, Vault, 1Password) and pull at runtime.
When NOT to persist
A few scenarios where fresh sessions are better:
- Sites that fingerprint session age. A "logged in 6 hours ago" cookie can be suspicious if your scraper bursts a thousand requests in two minutes.
- Rotating proxies with session-tied auth. If your auth is IP-bound (rare but exists), persisted sessions break when the proxy IP changes.
- A/B test buckets. Some sites assign you to a bucket on first visit; reusing that across runs may bias your scrape.
For most cases, persistence is a clear win. The exceptions are narrow.
Hands-on lab
Open /account/login, log in manually with headless=False, and save auth.json. Quit. Then run a separate script that uses storage_state="auth.json" to visit /account/dashboard, it should land logged in. Finally, extract the cookies and use requests to hit /account/orders directly. Note the speed difference: browser-driven vs HTTP-with-stolen-cookies.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/account/dashboardQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.