HTTPS, TLS, and Why It Matters for Scraping
What TLS actually does, the certificate verification you'll be tempted to disable but shouldn't, and the fingerprint that gets your scraper blocked before you ever send a request.
What you’ll learn
- Explain in one paragraph what TLS provides on top of TCP.
- Describe the TLS handshake and what data leaks during it.
- Decide when (rarely) it's acceptable to disable certificate verification.
- Recognise that the TLS fingerprint of your scraper is visible to servers before any HTTP.
HTTPS is just HTTP wrapped in TLS. That wrapper does three things:
- Confidentiality, encrypts the request/response so people in the middle can't read them.
- Integrity, guarantees the bytes you receive are the bytes the server sent.
- Authentication, proves the server is who it claims to be (via a certificate signed by a trusted CA).
For a scraper, all three matter, but the third one is where most beginners trip.
The handshake (slightly simplified)
Client Server
│ │
│ ClientHello (TLS versions, ciphers, SNI) │
│ ─────────────────────────────────────────► │
│ │
│ ServerHello (chosen cipher, certificate) │
│ ◄───────────────────────────────────────── │
│ │
│ Key exchange │
│ ────────────────────────────────────────► │
│ Key exchange │
│ ◄─────────────────────────────────────────-│
│ │
│ Finished (encrypted from here on) │
│ ◄═══════════════════════════════════════► │
│ │
│ HTTP request / response (encrypted) │
The important detail: the ClientHello is sent before any encryption. Anyone watching the wire (and the server) sees:
- Which TLS versions your client supports
- Which cipher suites, in what order
- Which extensions you advertise
- The hostname you want (in SNI, Server Name Indication, also plaintext)
That fingerprint is unique enough to tell python-requests apart from Chrome before a single HTTP byte is exchanged. We'll come back to this in Sub-Path 3 when we cover JA3/JA4 fingerprinting; for now, just know it exists.
SNI, why one IP can host many HTTPS sites
In the old days, one IP = one HTTPS site, because the certificate had to be presented before the server knew which hostname was wanted. SNI fixed that: the hostname is sent in the (plaintext) ClientHello, so the server can pick which certificate to send. This is why scraping by raw IP rarely works for HTTPS, without SNI, the server doesn't know which virtual host you wanted.
Certificate verification (don't turn it off)
When the server sends its certificate, your client checks:
- The certificate's hostname matches the URL hostname.
- The certificate is signed by a trusted Certificate Authority.
- The certificate hasn't expired.
- The signing chain reaches a root CA in your system's trust store.
If any check fails, your library throws. The temptation is to disable verification:
# DON'T do this in production
requests.get("https://example.com", verify=False)
You will see this in StackOverflow answers constantly. Resist. Here's why:
- A failed cert check usually means something is actually wrong (expired cert, MITM proxy, wrong hostname). Bypassing it loses the protection TLS provides.
- A correctly-configured scraper on the public internet should almost never see cert errors. If you're seeing them often, fix the root cause (update certifi, point at a real cert bundle, fix DNS).
The legitimate exceptions:
- Scraping behind a corporate MITM proxy (point your client at the corporate CA bundle, don't disable).
- Internal development against self-signed certs (only for
localhost/ private IPs).
What you actually control vs. what's automatic
| Layer | Set by | Notes |
|---|---|---|
| TCP connection | OS | You don't touch this directly |
| TLS version + ciphers | TLS library bundled with your runtime | Different across Python requests, curl, Node, Chrome |
| Certificate trust store | OS / certifi package |
Keep certifi up to date |
| SNI hostname | Your HTTP library | Set automatically from the URL |
| HTTP version (1.1, 2, 3) | TLS library negotiation | Most libraries do HTTP/2 transparently |
The TLS layer is where scrapers leak the most information about themselves, and have the least control without specialised libraries (curl-cffi, tls-client).
Why this matters for scraping
Three concrete consequences:
-
Your scraper has a TLS fingerprint. Default Python
requestsisJA3 hash X. Default Node.jshttpsisY. Both are well-known signatures that anti-bot systems block on sight. Sub-Path 3 covers how to spoof Chrome's fingerprint. -
SNI is plaintext. Servers know which hostname you wanted even if you tried to hide it with IPs.
-
Cert pinning happens. Mobile apps often pin specific certificates, refusing to trust the system store. If you intercept a mobile app's traffic with mitmproxy, certificate pinning is what stops you, covered in Sub-Path 3.
Hands-on lab
Use openssl s_client to look at Catalog108's certificate without making any HTTP request at all:
echo | openssl s_client -connect practice.scrapingcentral.com:443 -servername practice.scrapingcentral.com 2>/dev/null | openssl x509 -noout -subject -issuer -dates
You'll see the certificate subject, the CA that issued it, and validity dates, all before HTTP comes into the picture. This is what your browser and your scraper both check on every request.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.