Multi-Step Login Flows, Static Scraping

Beyond a single login form: multi-step wizards, MFA prompts, captchas, and the patterns to handle each from a scraper.

Single-step login is easy. Real-world login is rarely single-step anymore: enter email, get sent to a password page, maybe an MFA code, sometimes a "review your info" step. Each step issues a new CSRF token, may rotate cookies, and refuses to skip ahead. This lesson is the systematic approach.

Map the flow first

Before writing code, walk the flow manually in your browser with DevTools' Network tab open:

Click the login button.
Note every URL requested.
Note every form POSTed and its fields.
Note every cookie set or changed.
Note any redirects.

You'll typically see something like:

GET  /login  → 200, set cookie=session_v1
POST /login/identify  → 302, redirect to /login/password
GET  /login/password  → 200, NEW csrf token in form
POST /login/password  → 302, redirect to /dashboard
GET  /dashboard  → 200, logged in

Each arrow is a request your scraper must make in order, with cookies and tokens preserved.

Username-first / account-confirmation pattern

Common in modern SSO-style flows (Google, GitHub, Okta-protected apps):

import requests
from bs4 import BeautifulSoup

s = requests.Session()
s.headers["User-Agent"] = "Mozilla/5.0 ..."

# Step 1: identify
r = s.get("https://practice.scrapingcentral.com/challenges/static/forms/multi-step")
soup = BeautifulSoup(r.content, "lxml")
token1 = soup.select_one('input[name="csrf_token"]')["value"]

r = s.post(
  "https://practice.scrapingcentral.com/challenges/static/forms/multi-step/identify",
  data={"csrf_token": token1, "username": "student@practice.scrapingcentral.com"},
)

# Step 2: password
soup = BeautifulSoup(r.content, "lxml")  # response is the password page
token2 = soup.select_one('input[name="csrf_token"]')["value"]

r = s.post(
  "https://practice.scrapingcentral.com/challenges/static/forms/multi-step/password",
  data={"csrf_token": token2, "password": "practice123"},
)

# Step 3: confirm we landed somewhere logged-in
print(r.status_code, r.url)

Three things that matter:

Use the same Session throughout. Cookies persist between steps.
Re-extract the CSRF token from each response. Tokens almost always rotate per step.
Use the response of the previous POST as the source of the next form, sometimes the server returns the password form directly in the POST response (no separate GET needed). Inspect to confirm.

TOTP-based MFA

Many sites prompt for a 6-digit code from an authenticator app. If you have the original TOTP secret (the QR code's seed value), you can generate codes programmatically:

import pyotp
totp = pyotp.TOTP("JBSWY3DPEHPK3PXP")  # your seed
code = totp.now()
print(code)  # e.g. '847502'

Submit this as the MFA field. Same Session, same flow as a regular form step.

pyotp and PHP's various OTP libraries (otphp/otphp, phpgangsta/googleauthenticator) do the math. The hard part is getting the seed once during setup; after that, it's automation-friendly.

SMS- and email-based MFA

If the second factor is delivered via SMS or email, you have three options for automated testing:

Use a mock provider, services like Twilio test mode, or self-hosted mail catchers (MailHog, Mailpit). Your scraper polls the mailbox for the latest code.
Use a one-time secondary email/SMS API, services like email tester APIs that expose inboxes via HTTP.
Disable MFA on the test account, when you control the account.

For real production scrapes against sites you don't control, MFA scraping is usually NOT what you want, the legal and ethical risk is too high. MFA exists to ensure a human is present.

Persisting login across runs

Logging in on every scraper run is wasteful and may trip "suspicious login" alerts. Two patterns:

Save and reload cookies

import pickle

# After login, save the session cookies
with open("session.pkl", "wb") as f:
  pickle.dump(s.cookies, f)

# Next run, load and skip login
s = requests.Session()
with open("session.pkl", "rb") as f:
  s.cookies.update(pickle.load(f))

# Test if session is still valid
r = s.get("/dashboard")
if "Login" in r.text:
  # session expired, re-do full login
  ...

Use API tokens / refresh tokens

Many modern sites support OAuth or personal access tokens that are stable for weeks/months and don't require re-login. For OAuth flows specifically, see the API Scraping sub-path.

PHP: BrowserKit for the same flow

use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

$browser = new HttpBrowser(HttpClient::create());

// Step 1
$crawler = $browser->request('GET', 'https://practice.scrapingcentral.com/challenges/static/forms/multi-step');
$form = $crawler->selectButton('Continue')->form(['username' => '...']);
$crawler = $browser->submit($form);

// Step 2, Crawler is over the response of step 1 (password page)
$form = $crawler->selectButton('Login')->form(['password' => '...']);
$crawler = $browser->submit($form);

// Step 3, confirm
echo $browser->getResponse()->getStatusCode();

BrowserKit's automatic cookie + redirect handling makes this trivial, each $browser->submit() returns the Crawler over the next page, ready to extract the next form.

Detecting login success

Three signals to check:

URL after the final POST. A redirect to /dashboard (or wherever) is a positive signal; staying on /login is failure.
Status code. 200 OK after redirect or a 302 to an internal URL.
Page content. Look for "Welcome, [user]" or absence of "Sign in" links.

Combine all three, sites sometimes return 200 with an error inline ("Wrong password") and don't redirect.

def is_logged_in(r):
  if "/login" in r.url:
  return False
  if "Sign in" in r.text or "Login failed" in r.text:
  return False
  if "Welcome" in r.text or "Logout" in r.text:
  return True
  return None  # ambiguous, investigate manually

Handling rate limits and lockouts

Wrong credentials usually return a generic "invalid login", but repeated failures may lock the account. Three safety practices:

Cache successful sessions. Don't re-login on every run.
Don't retry on a failed login. A 401/403 means "stop," not "try again."
Limit login attempts in dev. Hit the lab with known-good creds. Brute-forcing is its own ethical/legal category.

What this won't help with

Captcha-on-login. That's an explicit anti-bot challenge. Sub-Path 5 (Production) covers captcha-solving services.
Behavioural fingerprinting. Some sites (banks, social) profile mouse movement, typing rhythm, even keystroke timing. Static scrapers can't pass; you'd need browser automation (Dynamic Web sub-path) and behavior simulation.

Hands-on lab

Work through /challenges/static/forms/multi-step. Identify by inspecting in your browser whether it's a 2-step username-then-password flow or has more steps. Implement the scraper. Confirm you land on a "logged in" page. Then deliberately submit the steps in the wrong order, confirm the server rejects.

Multi-Step Login Flows

What you’ll learn

Map the flow first

Username-first / account-confirmation pattern

TOTP-based MFA

SMS- and email-based MFA

Persisting login across runs

Save and reload cookies

Use API tokens / refresh tokens

PHP: BrowserKit for the same flow

Detecting login success

Handling rate limits and lockouts

What this won't help with

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why must you re-extract the CSRF token from each step's response in a multi-step login?