Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

3.17intermediate5 min read

JWT Tokens: Structure, Capture, Refresh

The modern API-auth standard. Three dot-separated base64 chunks, an access token, a refresh token, a 15-minute expiry. Here's how scrapers handle all of it.

What you’ll learn

  • Decode a JWT's three sections (header, payload, signature).
  • Capture access + refresh tokens from a login flow.
  • Refresh expired access tokens without re-logging in.
  • Detect expiry locally to avoid 401 round-trips.

JWT (JSON Web Token) is the dominant token format for modern APIs. If a site's /login returns {"access_token": "eyJ...", "refresh_token": "..."}, you're scraping a JWT-protected API.

This lesson covers the structure, the lifecycle, and how scrapers refresh without re-logging in.

Anatomy

A JWT is three URL-safe base64 chunks separated by dots:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxIiwiZW1haWwiOiJzdHVkZW50QHByYWN0aWNlLnNjcmFwaW5nY2VudHJhbC5jb20iLCJyb2xlIjoic3R1ZGVudCIsImV4cCI6MTczMjU2NzgwMH0.kY2hZWkOdjP9WqI8Y_xT5j8GpW...
└─────── header ─────────┘.└────────── payload ──────────┘.└────── signature ──────┘

Decode the three parts:

import base64, json

def decode_jwt(token: str) -> dict:
  header_b64, payload_b64, sig_b64 = token.split(".")
  # Add padding (base64 needs len % 4 == 0)
  def pad(b): return b + "=" * (-len(b) % 4)
  header = json.loads(base64.urlsafe_b64decode(pad(header_b64)))
  payload = json.loads(base64.urlsafe_b64decode(pad(payload_b64)))
  return {"header": header, "payload": payload, "signature": sig_b64}

token = "eyJhbGc..."
decoded = decode_jwt(token)
print(decoded["payload"])
# → {'sub': '1', 'email': '...', 'role': 'student', 'exp': 1732567800}

PHP version:

function decode_jwt(string $token): array {
  [$h, $p, $s] = explode('.', $token);
  $pad = fn($x) => $x . str_repeat('=', (4 - strlen($x) % 4) % 4);
  return [
  'header'  => json_decode(base64_decode(strtr($pad($h), '-_', '+/')), true),
  'payload'  => json_decode(base64_decode(strtr($pad($p), '-_', '+/')), true),
  'signature' => $s,
  ];
}

What you'll find in the payload:

  • sub, subject (user ID).
  • email, role, etc., claims about the user.
  • iat, issued-at (Unix timestamp).
  • exp, expiry (Unix timestamp). The most useful field for scrapers.
  • iss, aud, issuer, audience. Usually fixed.

Don't try to forge them

The signature proves the token was issued by the server's private key. You can't generate a valid token without that key (it's never client-side). Scrapers always capture tokens via the login flow, never forge them.

The two-token pattern

A modern auth flow returns two tokens:

{
  "access_token": "eyJ...",  // short-lived (5-15 min), used on every API call
  "refresh_token": "...",  // long-lived (days/weeks), only used to get new access tokens
  "expires_in": 900  // seconds until access_token expiry
}
  • Access token, sent on every protected call as Authorization: Bearer <access_token>.
  • Refresh token, sent only to /api/auth/refresh to obtain a new access token. Never sent elsewhere; treat as a secret.

Why two tokens? The access token can be exposed in logs, browser DevTools, etc., but expires quickly. The refresh token rotates rarely but stays well-protected. Compromise of one is bounded.

Capturing both tokens

import requests

def login(email, password):
  r = requests.post(
  "https://practice.scrapingcentral.com/api/auth/login",
  json={"email": email, "password": password},
  )
  r.raise_for_status()
  data = r.json()
  return data["access_token"], data["refresh_token"]

access, refresh = login("student@practice.scrapingcentral.com", "practice123")

Refreshing without re-logging in

When the access token expires (or is about to), call the refresh endpoint:

def refresh_access(refresh_token):
  r = requests.post(
  "https://practice.scrapingcentral.com/api/auth/refresh",
  json={"refresh_token": refresh_token},
  )
  r.raise_for_status()
  data = r.json()
  return data["access_token"], data["refresh_token"]  # refresh may rotate too

Some servers issue a new refresh token on each refresh ("refresh token rotation"). Always store whatever the response gives you back.

Local expiry detection

Don't wait for a 401, check expiry locally before each call:

import time

def is_expired(token: str, leeway: int = 30) -> bool:
  payload = decode_jwt(token)["payload"]
  return time.time() + leeway >= payload.get("exp", 0)

The 30-second leeway means "refresh if it expires within 30 seconds", avoids the race where the token is fine when you check but expired when the request lands.

Putting it all together, a JWT-aware client

import requests, time

class JwtClient:
  BASE = "https://practice.scrapingcentral.com"

  def __init__(self, email, password):
  self.email, self.password = email, password
  self.s = requests.Session()
  self.access = self.refresh = None
  self._login()

  def _login(self):
  r = self.s.post(f"{self.BASE}/api/auth/login",
  json={"email": self.email, "password": self.password})
  r.raise_for_status()
  d = r.json()
  self.access, self.refresh = d["access_token"], d["refresh_token"]

  def _maybe_refresh(self):
  if not self.access or self._is_expired(self.access):
  try:
  r = self.s.post(f"{self.BASE}/api/auth/refresh",
  json={"refresh_token": self.refresh})
  r.raise_for_status()
  d = r.json()
  self.access = d["access_token"]
  self.refresh = d.get("refresh_token", self.refresh)
  except requests.HTTPError:
  self._login()  # refresh expired too; fall back to full login

  @staticmethod
  def _is_expired(token: str, leeway: int = 30) -> bool:
  payload = decode_jwt(token)["payload"]
  return time.time() + leeway >= payload.get("exp", 0)

  def call(self, method, path, **kw):
  self._maybe_refresh()
  kw.setdefault("headers", {})["Authorization"] = f"Bearer {self.access}"
  r = self.s.request(method, f"{self.BASE}{path}", **kw)
  if r.status_code == 401:
  # token rejected unexpectedly, full re-login
  self._login()
  kw["headers"]["Authorization"] = f"Bearer {self.access}"
  r = self.s.request(method, f"{self.BASE}{path}", **kw)
  r.raise_for_status()
  return r.json() if r.content else None

PHP version

class JwtClient {
  private string $access = '';
  private string $refresh = '';

  public function __construct(private string $email, private string $password) {
  $this->login();
  }

  private function login(): void {
  $res = json_decode(
  file_get_contents('https://practice.scrapingcentral.com/api/auth/login',
  false,
  stream_context_create(['http' => [
  'method' => 'POST',
  'header' => "Content-Type: application/json\r\n",
  'content' => json_encode(['email' => $this->email, 'password' => $this->password]),
  ]])
  ), true);
  $this->access = $res['access_token'];
  $this->refresh = $res['refresh_token'];
  }
  // ...same shape: maybeRefresh, call, isExpired, etc.
}

(In practice use Guzzle, not file_get_contents, shown only for illustration.)

Storage caveats

  • The refresh token is a long-lived credential. Store it like a password: env vars, a vault, an encrypted file. Not in your git repo.
  • Access tokens are short-lived but still sensitive. Don't log them.
  • Some servers tie refresh tokens to the original device/IP. If you scrape from a different IP, the refresh may be rejected.

Hands-on lab

Hit Catalog108's /challenges/api/auth/jwt-with-refresh lab. Log in, decode the access_token, find the exp claim. Sleep until the token expires, then call a protected endpoint and observe the 401. Use the refresh endpoint to get a new access token without re-logging in. Wrap the whole thing in the JwtClient class above, your scraper now runs indefinitely without manual re-auth.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/api/auth/jwt-with-refresh

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

JWT Tokens: Structure, Capture, Refresh1 / 8

A JWT consists of how many dot-separated sections, and what are they?

Score so far: 0 / 0