HTTPS, TLS, and Why It Matters for Scraping, Foundations

What TLS actually does, the certificate verification you'll be tempted to disable but shouldn't, and the fingerprint that gets your scraper blocked before you ever send a request.

HTTPS is just HTTP wrapped in TLS. That wrapper does three things:

Confidentiality, encrypts the request/response so people in the middle can't read them.
Integrity, guarantees the bytes you receive are the bytes the server sent.
Authentication, proves the server is who it claims to be (via a certificate signed by a trusted CA).

For a scraper, all three matter, but the third one is where most beginners trip.

The handshake (slightly simplified)

Client  Server
  │  │
  │  ClientHello  (TLS versions, ciphers, SNI) │
  │ ─────────────────────────────────────────► │
  │  │
  │  ServerHello (chosen cipher, certificate)  │
  │ ◄───────────────────────────────────────── │
  │  │
  │  Key exchange  │
  │ ────────────────────────────────────────►  │
  │  Key exchange  │
  │ ◄─────────────────────────────────────────-│
  │  │
  │  Finished (encrypted from here on)  │
  │ ◄═══════════════════════════════════════►  │
  │  │
  │  HTTP request / response (encrypted)  │

The important detail: the ClientHello is sent before any encryption. Anyone watching the wire (and the server) sees:

Which TLS versions your client supports
Which cipher suites, in what order
Which extensions you advertise
The hostname you want (in SNI, Server Name Indication, also plaintext)

That fingerprint is unique enough to tell python-requests apart from Chrome before a single HTTP byte is exchanged. We'll come back to this in Sub-Path 3 when we cover JA3/JA4 fingerprinting; for now, just know it exists.

SNI, why one IP can host many HTTPS sites

In the old days, one IP = one HTTPS site, because the certificate had to be presented before the server knew which hostname was wanted. SNI fixed that: the hostname is sent in the (plaintext) ClientHello, so the server can pick which certificate to send. This is why scraping by raw IP rarely works for HTTPS, without SNI, the server doesn't know which virtual host you wanted.

Certificate verification (don't turn it off)

When the server sends its certificate, your client checks:

The certificate's hostname matches the URL hostname.
The certificate is signed by a trusted Certificate Authority.
The certificate hasn't expired.
The signing chain reaches a root CA in your system's trust store.

If any check fails, your library throws. The temptation is to disable verification:

# DON'T do this in production
requests.get("https://example.com", verify=False)

You will see this in StackOverflow answers constantly. Resist. Here's why:

A failed cert check usually means something is actually wrong (expired cert, MITM proxy, wrong hostname). Bypassing it loses the protection TLS provides.
A correctly-configured scraper on the public internet should almost never see cert errors. If you're seeing them often, fix the root cause (update certifi, point at a real cert bundle, fix DNS).

The legitimate exceptions:

Scraping behind a corporate MITM proxy (point your client at the corporate CA bundle, don't disable).
Internal development against self-signed certs (only for localhost / private IPs).

What you actually control vs. what's automatic

Layer	Set by	Notes
TCP connection	OS	You don't touch this directly
TLS version + ciphers	TLS library bundled with your runtime	Different across Python `requests`, `curl`, Node, Chrome
Certificate trust store	OS / `certifi` package	Keep `certifi` up to date
SNI hostname	Your HTTP library	Set automatically from the URL
HTTP version (1.1, 2, 3)	TLS library negotiation	Most libraries do HTTP/2 transparently

The TLS layer is where scrapers leak the most information about themselves, and have the least control without specialised libraries (curl-cffi, tls-client).

Why this matters for scraping

Three concrete consequences:

Your scraper has a TLS fingerprint. Default Python requests is JA3 hash X. Default Node.js https is Y. Both are well-known signatures that anti-bot systems block on sight. Sub-Path 3 covers how to spoof Chrome's fingerprint.
SNI is plaintext. Servers know which hostname you wanted even if you tried to hide it with IPs.
Cert pinning happens. Mobile apps often pin specific certificates, refusing to trust the system store. If you intercept a mobile app's traffic with mitmproxy, certificate pinning is what stops you, covered in Sub-Path 3.

Hands-on lab

Use openssl s_client to look at Catalog108's certificate without making any HTTP request at all:

echo | openssl s_client -connect practice.scrapingcentral.com:443 -servername practice.scrapingcentral.com 2>/dev/null | openssl x509 -noout -subject -issuer -dates

You'll see the certificate subject, the CA that issued it, and validity dates, all before HTTP comes into the picture. This is what your browser and your scraper both check on every request.

HTTPS, TLS, and Why It Matters for Scraping

What you’ll learn

The handshake (slightly simplified)

SNI, why one IP can host many HTTPS sites

Certificate verification (don't turn it off)

What you actually control vs. what's automatic

Why this matters for scraping

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which of the following is sent in PLAINTEXT during an HTTPS connection, before any encryption is established?