HTTP Protocol: Methods, Status Codes, Headers
The actual on-the-wire format your scraper speaks. Methods you'll use, status codes you'll interpret, headers that change behaviour.
What you’ll learn
- Read and write a raw HTTP request and response by hand.
- Pick the right method (GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS) for a task.
- Map every status code class (1xx–5xx) to a scraping response strategy.
- Identify the headers a scraper actually controls vs. ones the OS/library sets for you.
HTTP is the language your scraper speaks. You can scrape without understanding it, but you'll be guessing every time a request behaves unexpectedly. Spend an hour here and you'll debug ten times faster for the rest of your career.
The shape of a request
An HTTP request is plain text. Three parts:
GET /products?page=2 HTTP/1.1 ← request line
Host: practice.scrapingcentral.com ← headers (one per line)
User-Agent: my-scraper/1.0
Accept: text/html,application/xhtml+xml
Cookie: session=abc123
← blank line separating headers from body
<request body here, if any>
For a GET, the body is empty. For POST / PUT / PATCH, the body carries form data, JSON, or whatever the server expects (and you set Content-Type accordingly).
You can send a raw request by hand with nc:
printf 'GET / HTTP/1.1\r\nHost: example.com\r\nConnection: close\r\n\r\n' \
| nc example.com 80
Every scraping library is, at root, doing exactly that for you.
The methods you'll use
| Method | When | Body? | Typical use |
|---|---|---|---|
| GET | Fetch a resource | No | Listings, search, RSS, JSON APIs |
| POST | Submit data / create | Yes | Login, comments, search forms that don't fit in a URL |
| PUT | Replace a resource | Yes | Rare in scraping; common when interacting with REST APIs |
| PATCH | Partial update | Yes | Same, for "edit this one field" calls |
| DELETE | Remove a resource | No (usually) | Almost never used by scrapers |
| HEAD | Like GET but without the body | No | Cheap existence/freshness check, server returns headers only |
| OPTIONS | "What can I do at this URL?" | No | CORS preflight; rarely needed by scrapers |
90% of scraping is GET and the other 10% is POST. Treat the rest as "exist for completeness."
Status codes, the response signal
Every response starts with a code that tells you the broad outcome. Memorize the buckets, not every individual code.
| Class | Meaning | Examples | Scraper response |
|---|---|---|---|
| 1xx | Informational | 100 Continue | Ignore, library handles it |
| 2xx | Success | 200 OK, 201 Created, 204 No Content | Parse the body |
| 3xx | Redirect | 301 Moved Permanently, 302 Found, 304 Not Modified | Follow (or don't, if you want to capture the chain) |
| 4xx | You messed up | 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 429 Too Many Requests | Fix the request, don't retry blindly |
| 5xx | Server messed up | 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout | Retry with backoff, it's transient |
The two that confuse beginners most:
403 vs 401. 401 means "you didn't authenticate", send credentials and try again. 403 means "you authenticated fine but you're not allowed", credentials won't help. In scraping, 403 is often the anti-bot system saying it sniffed you out; sometimes geographic; rarely actually about authentication.
429. "Too Many Requests." Servers send this with a Retry-After: header telling you how long to wait. Respect it. Aggressive retry on 429 is the fastest way to get IP-banned.
Headers, where the action is
Headers are the metadata of HTTP. There are dozens. Scrapers care about a handful:
Request headers you should set
| Header | Purpose |
|---|---|
Host |
Which virtual host on the server you want (set automatically by HTTP libraries) |
User-Agent |
Identifies your client. Default Python requests/2.x is a giveaway, set a real browser UA to look normal |
Accept |
What content types you can parse, text/html,application/json,*/* is fine |
Accept-Language |
en-US,en;q=0.9 or whatever region you want responses in |
Accept-Encoding |
gzip, deflate, br, server compresses, library decompresses for you |
Cookie |
Session state, auth tokens, consent flags |
Referer |
Where you "came from", some sites gate access on a valid Referer |
Authorization |
Bearer tokens, Basic auth |
Response headers you should read
| Header | What it tells you |
|---|---|
Content-Type |
text/html, application/json, etc., parse accordingly |
Content-Length |
Body size in bytes |
Content-Encoding |
gzip / br, your library handles this transparently |
Set-Cookie |
New cookies to save and send back next time |
Location |
Where a 3xx redirect points |
Retry-After |
Required for 429 and 503, wait this many seconds |
Link |
RFC 5988 link headers, <...>; rel="next" is a common pagination pattern |
X-RateLimit-Remaining |
Non-standard but ubiquitous, how many requests you have left |
Connection: keep-alive
The bit nobody teaches: HTTP/1.1 keeps the TCP connection open by default. That means your second request to the same host doesn't pay the cost of a new TCP + TLS handshake. Use a session/client object (requests.Session, Guzzle's persistent client, Playwright's reused context) to keep connection reuse working. Switching to fresh connections on every request makes a scraper 3–5x slower and louder.
Hands-on lab
Use curl -v against practice.scrapingcentral.com/ (the Catalog108 homepage) and identify every header in both the request and the response. Then try curl -I, what's different? Then send a POST to any URL on the practice site and read the status code you get back. The point isn't to extract data, it's to internalize the exact lines flowing back and forth on the wire.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.