Handling Encoding and Broken HTML, Static Scraping

Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.

You'll meet pages that look fine in a browser but parse to garbage in your scraper. Almost always: an encoding mismatch. This lesson is the systematic fix.

The three layers of encoding

When a page leaves a server and arrives at your scraper, encoding can be declared in three places:

HTTP response header: Content-Type: text/html; charset=UTF-8
HTML <meta> tag: <meta charset="UTF-8"> or <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Document BOM (byte-order mark): first 2-3 bytes of the file indicating UTF-8/16/32.

If they disagree, decoders can pick the wrong one and you get mojibake (Año → AÃ±o).

How `requests` decides

import requests
r = requests.get(url)
print(r.encoding)  # what requests is using to decode .text
print(r.apparent_encoding)  # what chardet/charset-normalizer thinks

requests looks at the HTTP Content-Type: charset=.... If present, that's what it uses. If absent, it falls back to ISO-8859-1 (a legacy default, often wrong for real-world UTF-8 pages without an explicit header).

r.apparent_encoding runs an encoding detector on the body bytes and returns its best guess. Use it when r.encoding looks suspicious:

if r.encoding == "ISO-8859-1" and r.apparent_encoding != "ISO-8859-1":
  r.encoding = r.apparent_encoding
text = r.text

That single check fixes most "weird characters" bugs.

Use bytes, then decode explicitly

The cleanest pattern for tricky pages: skip r.text entirely, work with r.content bytes, and let the parser detect the encoding from the HTML's own declarations:

import lxml.html
tree = lxml.html.fromstring(r.content)

lxml respects <meta charset> in the document. So does BeautifulSoup with the lxml parser. This is the reason production scrapers prefer r.content + parser over r.text.

When even the page lies

Sometimes the <meta charset> is wrong. Real example: a page declares UTF-8 but is actually Windows-1252. Tell the parser explicitly:

import lxml.html
tree = lxml.html.fromstring(r.content.decode("windows-1252"))

Or use a detector:

import charset_normalizer
detected = charset_normalizer.from_bytes(r.content).best()
text = str(detected)
print("Detected encoding:", detected.encoding)

charset-normalizer ships with modern requests and is what powers r.apparent_encoding. The older chardet library does the same job, slightly differently.

The three labs at Catalog108

/challenges/static/encoding/utf8, clean UTF-8, declared correctly. Should work out of the box.
/challenges/static/encoding/latin1, ISO-8859-1 page with the correct Content-Type. r.text works but parser must be told.
/challenges/static/encoding/broken, page declares one charset but is encoded in another. Tests your detection fallback.

Run your scraper against all three; that's the lab.

Broken HTML, the other half of the problem

HTML in the wild violates the spec constantly:

<table>
  <tr><td>cell 1
  <tr><td>cell 2
</table>

Missing closing tags, no quotes on attributes, mixed case, content outside table cells, scripts injected mid-DOM. Modern HTML parsers are forgiving:

Parser	Tolerance	Speed
Python `html.parser` (stdlib)	Tolerant of most quirks	Slow
`lxml` HTML mode	Very tolerant	Fast
`lxml` XML/etree	Strict, refuses broken markup	Fast
`html5lib`	The most spec-correct, slowest	Slow

For HTML scraping, use lxml first. If a particularly broken page mis-parses, try html5lib as a fallback, it's slower but mimics what a browser does:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html5lib")

For sitemaps and RSS feeds (which ARE XML), use the strict XML parser, broken XML there usually means the feed is genuinely broken and you want to know.

Garbage characters: trace and fix

If your output has Â, Ã©, â€™, etc., you're almost certainly looking at UTF-8 bytes interpreted as Latin-1 / Windows-1252. The fix is to decode correctly at the source, re-decoding the already-broken string usually fails.

Workflow:

# 1. Print bytes hex around the suspected character
i = r.text.find("Â")
print(r.content[i-5:i+5].hex())

# 2. Cross-check: what's the actual encoding?
print("requests says:", r.encoding)
print("apparent:", r.apparent_encoding)
print("meta charset:", re.search(r'charset=([\w-]+)', r.text)[1])

If three values disagree, the page itself is lying. Force the right one.

Pragmatic boilerplate

import requests
import lxml.html

def fetch_and_parse(url):
  r = requests.get(url, timeout=10)
  r.raise_for_status()

  # If requests fell back to ISO-8859-1 but the detector says otherwise, override
  if r.encoding == "ISO-8859-1" and r.apparent_encoding:
  r.encoding = r.apparent_encoding

  # Prefer bytes + parser-level detection
  try:
  return lxml.html.fromstring(r.content)
  except (UnicodeDecodeError, ValueError):
  return lxml.html.fromstring(r.text)

This handles 95% of real-world encoding issues without intervention.

Don't strip, escape

When saving scraped data, encode it as UTF-8 explicitly so downstream consumers don't have to guess:

with open("output.json", "w", encoding="utf-8") as f:
  json.dump(data, f, ensure_ascii=False, indent=2)

ensure_ascii=False keeps non-ASCII characters as themselves (Año, not Año). Otherwise JSON escapes them, which is technically correct but harder to read.

Hands-on lab

Hit all three encoding challenges. For /challenges/static/encoding/broken, your first naive attempt should produce mojibake. Use r.apparent_encoding to detect the real encoding, override r.encoding, and confirm clean output. Then try the bytes + lxml approach without overriding r.encoding and compare results.

Handling Encoding and Broken HTML

What you’ll learn

The three layers of encoding

How `requests` decides

Use bytes, then decode explicitly

When even the page lies

The three labs at Catalog108

Broken HTML, the other half of the problem

Garbage characters: trace and fix

Pragmatic boilerplate

Don't strip, escape

Hands-on lab

Hands-on lab

Quiz, check your understanding

If `Content-Type` doesn't declare a charset, what does the `requests` library default to?

Handling Encoding and Broken HTML

What you’ll learn

The three layers of encoding

How requests decides

Use bytes, then decode explicitly

When even the page lies

The three labs at Catalog108

Broken HTML, the other half of the problem

Garbage characters: trace and fix

Pragmatic boilerplate

Don't strip, escape

Hands-on lab

Hands-on lab

Quiz, check your understanding

If `Content-Type` doesn't declare a charset, what does the `requests` library default to?

How `requests` decides