Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F2beginner5 min read

HTTP Protocol: Methods, Status Codes, Headers

The actual on-the-wire format your scraper speaks. Methods you'll use, status codes you'll interpret, headers that change behaviour.

What you’ll learn

  • Read and write a raw HTTP request and response by hand.
  • Pick the right method (GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS) for a task.
  • Map every status code class (1xx–5xx) to a scraping response strategy.
  • Identify the headers a scraper actually controls vs. ones the OS/library sets for you.

HTTP is the language your scraper speaks. You can scrape without understanding it, but you'll be guessing every time a request behaves unexpectedly. Spend an hour here and you'll debug ten times faster for the rest of your career.

The shape of a request

An HTTP request is plain text. Three parts:

GET /products?page=2 HTTP/1.1  ← request line
Host: practice.scrapingcentral.com  ← headers (one per line)
User-Agent: my-scraper/1.0
Accept: text/html,application/xhtml+xml
Cookie: session=abc123

  ← blank line separating headers from body
<request body here, if any>

For a GET, the body is empty. For POST / PUT / PATCH, the body carries form data, JSON, or whatever the server expects (and you set Content-Type accordingly).

You can send a raw request by hand with nc:

printf 'GET / HTTP/1.1\r\nHost: example.com\r\nConnection: close\r\n\r\n' \
  | nc example.com 80

Every scraping library is, at root, doing exactly that for you.

The methods you'll use

Method When Body? Typical use
GET Fetch a resource No Listings, search, RSS, JSON APIs
POST Submit data / create Yes Login, comments, search forms that don't fit in a URL
PUT Replace a resource Yes Rare in scraping; common when interacting with REST APIs
PATCH Partial update Yes Same, for "edit this one field" calls
DELETE Remove a resource No (usually) Almost never used by scrapers
HEAD Like GET but without the body No Cheap existence/freshness check, server returns headers only
OPTIONS "What can I do at this URL?" No CORS preflight; rarely needed by scrapers

90% of scraping is GET and the other 10% is POST. Treat the rest as "exist for completeness."

Status codes, the response signal

Every response starts with a code that tells you the broad outcome. Memorize the buckets, not every individual code.

Class Meaning Examples Scraper response
1xx Informational 100 Continue Ignore, library handles it
2xx Success 200 OK, 201 Created, 204 No Content Parse the body
3xx Redirect 301 Moved Permanently, 302 Found, 304 Not Modified Follow (or don't, if you want to capture the chain)
4xx You messed up 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 429 Too Many Requests Fix the request, don't retry blindly
5xx Server messed up 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout Retry with backoff, it's transient

The two that confuse beginners most:

403 vs 401. 401 means "you didn't authenticate", send credentials and try again. 403 means "you authenticated fine but you're not allowed", credentials won't help. In scraping, 403 is often the anti-bot system saying it sniffed you out; sometimes geographic; rarely actually about authentication.

429. "Too Many Requests." Servers send this with a Retry-After: header telling you how long to wait. Respect it. Aggressive retry on 429 is the fastest way to get IP-banned.

Headers, where the action is

Headers are the metadata of HTTP. There are dozens. Scrapers care about a handful:

Request headers you should set

Header Purpose
Host Which virtual host on the server you want (set automatically by HTTP libraries)
User-Agent Identifies your client. Default Python requests/2.x is a giveaway, set a real browser UA to look normal
Accept What content types you can parse, text/html,application/json,*/* is fine
Accept-Language en-US,en;q=0.9 or whatever region you want responses in
Accept-Encoding gzip, deflate, br, server compresses, library decompresses for you
Cookie Session state, auth tokens, consent flags
Referer Where you "came from", some sites gate access on a valid Referer
Authorization Bearer tokens, Basic auth

Response headers you should read

Header What it tells you
Content-Type text/html, application/json, etc., parse accordingly
Content-Length Body size in bytes
Content-Encoding gzip / br, your library handles this transparently
Set-Cookie New cookies to save and send back next time
Location Where a 3xx redirect points
Retry-After Required for 429 and 503, wait this many seconds
Link RFC 5988 link headers, <...>; rel="next" is a common pagination pattern
X-RateLimit-Remaining Non-standard but ubiquitous, how many requests you have left

Connection: keep-alive

The bit nobody teaches: HTTP/1.1 keeps the TCP connection open by default. That means your second request to the same host doesn't pay the cost of a new TCP + TLS handshake. Use a session/client object (requests.Session, Guzzle's persistent client, Playwright's reused context) to keep connection reuse working. Switching to fresh connections on every request makes a scraper 3–5x slower and louder.

Hands-on lab

Use curl -v against practice.scrapingcentral.com/ (the Catalog108 homepage) and identify every header in both the request and the response. Then try curl -I, what's different? Then send a POST to any URL on the practice site and read the status code you get back. The point isn't to extract data, it's to internalize the exact lines flowing back and forth on the wire.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

HTTP Protocol: Methods, Status Codes, Headers1 / 8

Your scraper gets a 429 response. The right move is to:

Score so far: 0 / 0