Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F3beginner4 min read

HTTPS, TLS, and Why It Matters for Scraping

What TLS actually does, the certificate verification you'll be tempted to disable but shouldn't, and the fingerprint that gets your scraper blocked before you ever send a request.

What you’ll learn

  • Explain in one paragraph what TLS provides on top of TCP.
  • Describe the TLS handshake and what data leaks during it.
  • Decide when (rarely) it's acceptable to disable certificate verification.
  • Recognise that the TLS fingerprint of your scraper is visible to servers before any HTTP.

HTTPS is just HTTP wrapped in TLS. That wrapper does three things:

  1. Confidentiality, encrypts the request/response so people in the middle can't read them.
  2. Integrity, guarantees the bytes you receive are the bytes the server sent.
  3. Authentication, proves the server is who it claims to be (via a certificate signed by a trusted CA).

For a scraper, all three matter, but the third one is where most beginners trip.

The handshake (slightly simplified)

Client  Server
  │  │
  │  ClientHello  (TLS versions, ciphers, SNI) │
  │ ─────────────────────────────────────────► │
  │  │
  │  ServerHello (chosen cipher, certificate)  │
  │ ◄───────────────────────────────────────── │
  │  │
  │  Key exchange  │
  │ ────────────────────────────────────────►  │
  │  Key exchange  │
  │ ◄─────────────────────────────────────────-│
  │  │
  │  Finished (encrypted from here on)  │
  │ ◄═══════════════════════════════════════►  │
  │  │
  │  HTTP request / response (encrypted)  │

The important detail: the ClientHello is sent before any encryption. Anyone watching the wire (and the server) sees:

  • Which TLS versions your client supports
  • Which cipher suites, in what order
  • Which extensions you advertise
  • The hostname you want (in SNI, Server Name Indication, also plaintext)

That fingerprint is unique enough to tell python-requests apart from Chrome before a single HTTP byte is exchanged. We'll come back to this in Sub-Path 3 when we cover JA3/JA4 fingerprinting; for now, just know it exists.

SNI, why one IP can host many HTTPS sites

In the old days, one IP = one HTTPS site, because the certificate had to be presented before the server knew which hostname was wanted. SNI fixed that: the hostname is sent in the (plaintext) ClientHello, so the server can pick which certificate to send. This is why scraping by raw IP rarely works for HTTPS, without SNI, the server doesn't know which virtual host you wanted.

Certificate verification (don't turn it off)

When the server sends its certificate, your client checks:

  1. The certificate's hostname matches the URL hostname.
  2. The certificate is signed by a trusted Certificate Authority.
  3. The certificate hasn't expired.
  4. The signing chain reaches a root CA in your system's trust store.

If any check fails, your library throws. The temptation is to disable verification:

# DON'T do this in production
requests.get("https://example.com", verify=False)

You will see this in StackOverflow answers constantly. Resist. Here's why:

  • A failed cert check usually means something is actually wrong (expired cert, MITM proxy, wrong hostname). Bypassing it loses the protection TLS provides.
  • A correctly-configured scraper on the public internet should almost never see cert errors. If you're seeing them often, fix the root cause (update certifi, point at a real cert bundle, fix DNS).

The legitimate exceptions:

  • Scraping behind a corporate MITM proxy (point your client at the corporate CA bundle, don't disable).
  • Internal development against self-signed certs (only for localhost / private IPs).

What you actually control vs. what's automatic

Layer Set by Notes
TCP connection OS You don't touch this directly
TLS version + ciphers TLS library bundled with your runtime Different across Python requests, curl, Node, Chrome
Certificate trust store OS / certifi package Keep certifi up to date
SNI hostname Your HTTP library Set automatically from the URL
HTTP version (1.1, 2, 3) TLS library negotiation Most libraries do HTTP/2 transparently

The TLS layer is where scrapers leak the most information about themselves, and have the least control without specialised libraries (curl-cffi, tls-client).

Why this matters for scraping

Three concrete consequences:

  1. Your scraper has a TLS fingerprint. Default Python requests is JA3 hash X. Default Node.js https is Y. Both are well-known signatures that anti-bot systems block on sight. Sub-Path 3 covers how to spoof Chrome's fingerprint.

  2. SNI is plaintext. Servers know which hostname you wanted even if you tried to hide it with IPs.

  3. Cert pinning happens. Mobile apps often pin specific certificates, refusing to trust the system store. If you intercept a mobile app's traffic with mitmproxy, certificate pinning is what stops you, covered in Sub-Path 3.

Hands-on lab

Use openssl s_client to look at Catalog108's certificate without making any HTTP request at all:

echo | openssl s_client -connect practice.scrapingcentral.com:443 -servername practice.scrapingcentral.com 2>/dev/null | openssl x509 -noout -subject -issuer -dates

You'll see the certificate subject, the CA that issued it, and validity dates, all before HTTP comes into the picture. This is what your browser and your scraper both check on every request.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

HTTPS, TLS, and Why It Matters for Scraping1 / 8

Which of the following is sent in PLAINTEXT during an HTTPS connection, before any encryption is established?

Score so far: 0 / 0