HTTP Protocol: Methods, Status Codes, Headers, Foundations

The actual on-the-wire format your scraper speaks. Methods you'll use, status codes you'll interpret, headers that change behaviour.

HTTP is the language your scraper speaks. You can scrape without understanding it, but you'll be guessing every time a request behaves unexpectedly. Spend an hour here and you'll debug ten times faster for the rest of your career.

The shape of a request

An HTTP request is plain text. Three parts:

GET /products?page=2 HTTP/1.1  ← request line
Host: practice.scrapingcentral.com  ← headers (one per line)
User-Agent: my-scraper/1.0
Accept: text/html,application/xhtml+xml
Cookie: session=abc123

  ← blank line separating headers from body
<request body here, if any>

For a GET, the body is empty. For POST / PUT / PATCH, the body carries form data, JSON, or whatever the server expects (and you set Content-Type accordingly).

You can send a raw request by hand with nc:

printf 'GET / HTTP/1.1\r\nHost: example.com\r\nConnection: close\r\n\r\n' \
  | nc example.com 80

Every scraping library is, at root, doing exactly that for you.

The methods you'll use

Method	When	Body?	Typical use
GET	Fetch a resource	No	Listings, search, RSS, JSON APIs
POST	Submit data / create	Yes	Login, comments, search forms that don't fit in a URL
PUT	Replace a resource	Yes	Rare in scraping; common when interacting with REST APIs
PATCH	Partial update	Yes	Same, for "edit this one field" calls
DELETE	Remove a resource	No (usually)	Almost never used by scrapers
HEAD	Like GET but without the body	No	Cheap existence/freshness check, server returns headers only
OPTIONS	"What can I do at this URL?"	No	CORS preflight; rarely needed by scrapers

90% of scraping is GET and the other 10% is POST. Treat the rest as "exist for completeness."

Status codes, the response signal

Every response starts with a code that tells you the broad outcome. Memorize the buckets, not every individual code.

Class	Meaning	Examples	Scraper response
1xx	Informational	100 Continue	Ignore, library handles it
2xx	Success	200 OK, 201 Created, 204 No Content	Parse the body
3xx	Redirect	301 Moved Permanently, 302 Found, 304 Not Modified	Follow (or don't, if you want to capture the chain)
4xx	You messed up	400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 429 Too Many Requests	Fix the request, don't retry blindly
5xx	Server messed up	500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout	Retry with backoff, it's transient

The two that confuse beginners most:

403 vs 401. 401 means "you didn't authenticate", send credentials and try again. 403 means "you authenticated fine but you're not allowed", credentials won't help. In scraping, 403 is often the anti-bot system saying it sniffed you out; sometimes geographic; rarely actually about authentication.

429. "Too Many Requests." Servers send this with a Retry-After: header telling you how long to wait. Respect it. Aggressive retry on 429 is the fastest way to get IP-banned.

Headers, where the action is

Headers are the metadata of HTTP. There are dozens. Scrapers care about a handful:

Request headers you should set

Header	Purpose
`Host`	Which virtual host on the server you want (set automatically by HTTP libraries)
`User-Agent`	Identifies your client. Default Python `requests/2.x` is a giveaway, set a real browser UA to look normal
`Accept`	What content types you can parse, `text/html,application/json,/` is fine
`Accept-Language`	`en-US,en;q=0.9` or whatever region you want responses in
`Accept-Encoding`	`gzip, deflate, br`, server compresses, library decompresses for you
`Cookie`	Session state, auth tokens, consent flags
`Referer`	Where you "came from", some sites gate access on a valid Referer
`Authorization`	Bearer tokens, Basic auth

Response headers you should read

Header	What it tells you
`Content-Type`	`text/html`, `application/json`, etc., parse accordingly
`Content-Length`	Body size in bytes
`Content-Encoding`	`gzip` / `br`, your library handles this transparently
`Set-Cookie`	New cookies to save and send back next time
`Location`	Where a `3xx` redirect points
`Retry-After`	Required for `429` and `503`, wait this many seconds
`Link`	RFC 5988 link headers, `<...>; rel="next"` is a common pagination pattern
`X-RateLimit-Remaining`	Non-standard but ubiquitous, how many requests you have left

Connection: keep-alive

The bit nobody teaches: HTTP/1.1 keeps the TCP connection open by default. That means your second request to the same host doesn't pay the cost of a new TCP + TLS handshake. Use a session/client object (requests.Session, Guzzle's persistent client, Playwright's reused context) to keep connection reuse working. Switching to fresh connections on every request makes a scraper 3–5x slower and louder.

Hands-on lab

Use curl -v against practice.scrapingcentral.com/ (the Catalog108 homepage) and identify every header in both the request and the response. Then try curl -I, what's different? Then send a POST to any URL on the practice site and read the status code you get back. The point isn't to extract data, it's to internalize the exact lines flowing back and forth on the wire.

HTTP Protocol: Methods, Status Codes, Headers

What you’ll learn

The shape of a request

The methods you'll use

Status codes, the response signal

Headers, where the action is

Request headers you should set

Response headers you should read

Connection: keep-alive

Hands-on lab

Hands-on lab

Quiz, check your understanding

Your scraper gets a 429 response. The right move is to: