Cookies, Sessions, and Authentication Basics, Foundations

How servers remember who you are between requests, and how scrapers persist that state correctly without re-logging-in every time.

HTTP is stateless. Each request is independent, the server has no built-in memory of who you are. Cookies are the duct-tape fix that makes the modern web work, and they're the single biggest source of "my scraper works in curl but not in Python" bugs.

What a cookie actually is

A cookie is a name/value pair the server asks the client to remember and send back on every subsequent request to the same site.

Server says: remember this

HTTP/1.1 200 OK
Content-Type: text/html
Set-Cookie: session=abc123; Path=/; HttpOnly; Secure; SameSite=Lax
Set-Cookie: prefs=dark; Path=/; Max-Age=2592000

Client says: here you go

GET /dashboard HTTP/1.1
Host: practice.scrapingcentral.com
Cookie: session=abc123; prefs=dark

That's it. Cookies are just a header (or several) on each side. Every cookie has:

Attribute	What it does
`Name=Value`	The data itself
`Domain`	Which hosts the cookie applies to (e.g. `.scrapingcentral.com` matches all subdomains)
`Path`	Which paths under the domain (default `/`)
`Expires` / `Max-Age`	When to forget it. No expiry = session cookie, deleted on browser close
`Secure`	Only send over HTTPS
`HttpOnly`	JavaScript can't read it (security feature; irrelevant to a server-side scraper)
`SameSite`	`Strict` / `Lax` / `None`, controls when the cookie is sent in cross-site requests

Session cookies vs. persistent cookies

The distinction trips up scrapers constantly:

Session cookie = no Expires and no Max-Age. The browser deletes it when the browser process closes.
Persistent cookie = explicit Expires or Max-Age. Survives browser restarts.

For a scraper, "session" is whatever lives in your in-memory cookie jar for the run. The server doesn't actually know if you closed your "browser", it just knows whether you sent the cookie next time.

Sessions on the server side

What's in session=abc123? Usually one of two things:

Opaque session ID. A random token. The server keeps a database (in Redis, in memory, or in a DB table) mapping abc123 → {user_id: 42, csrf_token: ......}. The cookie is meaningless to the client.

Self-contained token. A JWT or signed cookie. The cookie contains the data (user id, expiry, signature). The server verifies the signature, no DB lookup. Common for stateless APIs.

You don't usually need to know which one, you just need to send the cookie back. But it matters for session expiry: opaque IDs can be revoked server-side; signed tokens expire on a hard timestamp and can't easily be revoked.

Replaying a logged-in session

The scraper recipe is:

Open the site in a real browser. Log in normally.
Open DevTools → Application tab → Cookies. Find the auth cookie(s).
Copy the name/value pairs into your scraper's session.
Hit authenticated URLs directly.

In Python:

import requests

session = requests.Session()
session.cookies.update({
  "session": "abc123",
  "csrf": "xyz789",
})
r = session.get("https://practice.scrapingcentral.com/account/dashboard")
print(r.status_code, r.text[:200])

In PHP with Guzzle:

$jar = \GuzzleHttp\Cookie\CookieJar::fromArray([
  'session' => 'abc123',
  'csrf'  => 'xyz789',
], 'practice.scrapingcentral.com');

$client = new \GuzzleHttp\Client(['cookies' => $jar]);
$res = $client->get('https://practice.scrapingcentral.com/account/dashboard');

This works until the cookie expires or the server invalidates the session (typical lifetime: hours to days). For longer scrapes, you script the login flow itself, covered in Sub-Path 1.

When manual cookie replay fails

Three common gotchas:

Cookies are domain-scoped. A cookie set on practice.scrapingcentral.com won't be sent to api.scrapingcentral.com unless the Domain attribute is .scrapingcentral.com. Inspect the actual Domain value, not just the cookie name.
HttpOnly is irrelevant to scrapers but causes confusion. It only stops browser JavaScript from reading the cookie. The cookie still flows over HTTP and a scraper sees it fine, IF it captured it from the response, not from document.cookie.
First-visit cookies. Some sites set cookies on the first page load and require them on the next request. A fresh scraper that hits the protected URL directly gets locked out. Visit the landing page first, accept the cookies, then proceed. The Catalog108 challenge /challenges/static/cookies/set-on-visit is built exactly for this pattern.

Authentication flavours you'll meet

Style	What's in the request	Sub-Path covered
Cookie session	`Cookie: session=...`	1 (this lesson)
HTTP Basic	`Authorization: Basic base64(user:pass)`	1
Bearer token	`Authorization: Bearer eyJ...` (often JWT)	3
OAuth 2.0	A multi-step dance to obtain a bearer token	3
HMAC-signed	Signature header derived from request body + secret	3

The vast majority of public web scraping is the first one. The API world uses the rest. Catalog108 has labs for all five.

Hands-on lab

The challenge at /challenges/static/cookies/set-on-visit requires you to visit the landing page (where a cookie is set), then send a follow-up request with that cookie to see protected content. Try it with curl first using -c cookies.txt -b cookies.txt to persist cookies across calls, then replicate the behaviour in Python requests.Session() and Guzzle. Same recipe, three implementations.

Cookies, Sessions, and Authentication Basics

What you’ll learn

What a cookie actually is

Server says: remember this

Client says: here you go

Session cookies vs. persistent cookies

Sessions on the server side

Replaying a logged-in session

When manual cookie replay fails

Authentication flavours you'll meet

Hands-on lab

Hands-on lab

Quiz, check your understanding

A cookie has no Expires and no Max-Age attribute. What is its lifetime?