Cookies, Sessions, and Authentication Basics
How servers remember who you are between requests, and how scrapers persist that state correctly without re-logging-in every time.
What you’ll learn
- Explain what a cookie is at the protocol level (Set-Cookie / Cookie headers).
- Distinguish session cookies, persistent cookies, and the Secure / HttpOnly / SameSite flags.
- Replay a logged-in session in a scraper given only the cookies a browser collected.
- Pick the right session-persistence approach for your scraper: in-memory, on-disk, or per-worker.
HTTP is stateless. Each request is independent, the server has no built-in memory of who you are. Cookies are the duct-tape fix that makes the modern web work, and they're the single biggest source of "my scraper works in curl but not in Python" bugs.
What a cookie actually is
A cookie is a name/value pair the server asks the client to remember and send back on every subsequent request to the same site.
Server says: remember this
HTTP/1.1 200 OK
Content-Type: text/html
Set-Cookie: session=abc123; Path=/; HttpOnly; Secure; SameSite=Lax
Set-Cookie: prefs=dark; Path=/; Max-Age=2592000
Client says: here you go
GET /dashboard HTTP/1.1
Host: practice.scrapingcentral.com
Cookie: session=abc123; prefs=dark
That's it. Cookies are just a header (or several) on each side. Every cookie has:
| Attribute | What it does |
|---|---|
Name=Value |
The data itself |
Domain |
Which hosts the cookie applies to (e.g. .scrapingcentral.com matches all subdomains) |
Path |
Which paths under the domain (default /) |
Expires / Max-Age |
When to forget it. No expiry = session cookie, deleted on browser close |
Secure |
Only send over HTTPS |
HttpOnly |
JavaScript can't read it (security feature; irrelevant to a server-side scraper) |
SameSite |
Strict / Lax / None, controls when the cookie is sent in cross-site requests |
Session cookies vs. persistent cookies
The distinction trips up scrapers constantly:
- Session cookie = no
Expiresand noMax-Age. The browser deletes it when the browser process closes. - Persistent cookie = explicit
ExpiresorMax-Age. Survives browser restarts.
For a scraper, "session" is whatever lives in your in-memory cookie jar for the run. The server doesn't actually know if you closed your "browser", it just knows whether you sent the cookie next time.
Sessions on the server side
What's in session=abc123? Usually one of two things:
Opaque session ID. A random token. The server keeps a database (in Redis, in memory, or in a DB table) mapping abc123 → {user_id: 42, csrf_token: ......}. The cookie is meaningless to the client.
Self-contained token. A JWT or signed cookie. The cookie contains the data (user id, expiry, signature). The server verifies the signature, no DB lookup. Common for stateless APIs.
You don't usually need to know which one, you just need to send the cookie back. But it matters for session expiry: opaque IDs can be revoked server-side; signed tokens expire on a hard timestamp and can't easily be revoked.
Replaying a logged-in session
The scraper recipe is:
- Open the site in a real browser. Log in normally.
- Open DevTools → Application tab → Cookies. Find the auth cookie(s).
- Copy the name/value pairs into your scraper's session.
- Hit authenticated URLs directly.
In Python:
import requests
session = requests.Session()
session.cookies.update({
"session": "abc123",
"csrf": "xyz789",
})
r = session.get("https://practice.scrapingcentral.com/account/dashboard")
print(r.status_code, r.text[:200])
In PHP with Guzzle:
$jar = \GuzzleHttp\Cookie\CookieJar::fromArray([
'session' => 'abc123',
'csrf' => 'xyz789',
], 'practice.scrapingcentral.com');
$client = new \GuzzleHttp\Client(['cookies' => $jar]);
$res = $client->get('https://practice.scrapingcentral.com/account/dashboard');
This works until the cookie expires or the server invalidates the session (typical lifetime: hours to days). For longer scrapes, you script the login flow itself, covered in Sub-Path 1.
When manual cookie replay fails
Three common gotchas:
-
Cookies are domain-scoped. A cookie set on
practice.scrapingcentral.comwon't be sent toapi.scrapingcentral.comunless theDomainattribute is.scrapingcentral.com. Inspect the actualDomainvalue, not just the cookie name. -
HttpOnly is irrelevant to scrapers but causes confusion. It only stops browser JavaScript from reading the cookie. The cookie still flows over HTTP and a scraper sees it fine, IF it captured it from the response, not from
document.cookie. -
First-visit cookies. Some sites set cookies on the first page load and require them on the next request. A fresh scraper that hits the protected URL directly gets locked out. Visit the landing page first, accept the cookies, then proceed. The Catalog108 challenge
/challenges/static/cookies/set-on-visitis built exactly for this pattern.
Authentication flavours you'll meet
| Style | What's in the request | Sub-Path covered |
|---|---|---|
| Cookie session | Cookie: session=... |
1 (this lesson) |
| HTTP Basic | Authorization: Basic base64(user:pass) |
1 |
| Bearer token | Authorization: Bearer eyJ... (often JWT) |
3 |
| OAuth 2.0 | A multi-step dance to obtain a bearer token | 3 |
| HMAC-signed | Signature header derived from request body + secret | 3 |
The vast majority of public web scraping is the first one. The API world uses the rest. Catalog108 has labs for all five.
Hands-on lab
The challenge at /challenges/static/cookies/set-on-visit requires you to visit the landing page (where a cookie is set), then send a follow-up request with that cookie to see protected content. Try it with curl first using -c cookies.txt -b cookies.txt to persist cookies across calls, then replicate the behaviour in Python requests.Session() and Guzzle. Same recipe, three implementations.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/cookies/set-on-visitQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.