Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.16intermediate4 min read

Handling Encoding and Broken HTML

Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.

What you’ll learn

  • Understand the difference between `response.content`, `response.text`, and `response.encoding`.
  • Detect a page's actual encoding from headers, BOM, and meta tags.
  • Use `chardet` / `charset-normalizer` for tricky cases.
  • Choose a forgiving parser for broken HTML.

You'll meet pages that look fine in a browser but parse to garbage in your scraper. Almost always: an encoding mismatch. This lesson is the systematic fix.

The three layers of encoding

When a page leaves a server and arrives at your scraper, encoding can be declared in three places:

  1. HTTP response header: Content-Type: text/html; charset=UTF-8
  2. HTML <meta> tag: <meta charset="UTF-8"> or <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  3. Document BOM (byte-order mark): first 2-3 bytes of the file indicating UTF-8/16/32.

If they disagree, decoders can pick the wrong one and you get mojibake (AñoAño).

How requests decides

import requests
r = requests.get(url)
print(r.encoding)  # what requests is using to decode .text
print(r.apparent_encoding)  # what chardet/charset-normalizer thinks

requests looks at the HTTP Content-Type: charset=.... If present, that's what it uses. If absent, it falls back to ISO-8859-1 (a legacy default, often wrong for real-world UTF-8 pages without an explicit header).

r.apparent_encoding runs an encoding detector on the body bytes and returns its best guess. Use it when r.encoding looks suspicious:

if r.encoding == "ISO-8859-1" and r.apparent_encoding != "ISO-8859-1":
  r.encoding = r.apparent_encoding
text = r.text

That single check fixes most "weird characters" bugs.

Use bytes, then decode explicitly

The cleanest pattern for tricky pages: skip r.text entirely, work with r.content bytes, and let the parser detect the encoding from the HTML's own declarations:

import lxml.html
tree = lxml.html.fromstring(r.content)

lxml respects <meta charset> in the document. So does BeautifulSoup with the lxml parser. This is the reason production scrapers prefer r.content + parser over r.text.

When even the page lies

Sometimes the <meta charset> is wrong. Real example: a page declares UTF-8 but is actually Windows-1252. Tell the parser explicitly:

import lxml.html
tree = lxml.html.fromstring(r.content.decode("windows-1252"))

Or use a detector:

import charset_normalizer
detected = charset_normalizer.from_bytes(r.content).best()
text = str(detected)
print("Detected encoding:", detected.encoding)

charset-normalizer ships with modern requests and is what powers r.apparent_encoding. The older chardet library does the same job, slightly differently.

The three labs at Catalog108

  • /challenges/static/encoding/utf8, clean UTF-8, declared correctly. Should work out of the box.
  • /challenges/static/encoding/latin1, ISO-8859-1 page with the correct Content-Type. r.text works but parser must be told.
  • /challenges/static/encoding/broken, page declares one charset but is encoded in another. Tests your detection fallback.

Run your scraper against all three; that's the lab.

Broken HTML, the other half of the problem

HTML in the wild violates the spec constantly:

<table>
  <tr><td>cell 1
  <tr><td>cell 2
</table>

Missing closing tags, no quotes on attributes, mixed case, content outside table cells, scripts injected mid-DOM. Modern HTML parsers are forgiving:

Parser Tolerance Speed
Python html.parser (stdlib) Tolerant of most quirks Slow
lxml HTML mode Very tolerant Fast
lxml XML/etree Strict, refuses broken markup Fast
html5lib The most spec-correct, slowest Slow

For HTML scraping, use lxml first. If a particularly broken page mis-parses, try html5lib as a fallback, it's slower but mimics what a browser does:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html5lib")

For sitemaps and RSS feeds (which ARE XML), use the strict XML parser, broken XML there usually means the feed is genuinely broken and you want to know.

Garbage characters: trace and fix

If your output has Â, é, ’, etc., you're almost certainly looking at UTF-8 bytes interpreted as Latin-1 / Windows-1252. The fix is to decode correctly at the source, re-decoding the already-broken string usually fails.

Workflow:

# 1. Print bytes hex around the suspected character
i = r.text.find("Â")
print(r.content[i-5:i+5].hex())

# 2. Cross-check: what's the actual encoding?
print("requests says:", r.encoding)
print("apparent:", r.apparent_encoding)
print("meta charset:", re.search(r'charset=([\w-]+)', r.text)[1])

If three values disagree, the page itself is lying. Force the right one.

Pragmatic boilerplate

import requests
import lxml.html

def fetch_and_parse(url):
  r = requests.get(url, timeout=10)
  r.raise_for_status()

  # If requests fell back to ISO-8859-1 but the detector says otherwise, override
  if r.encoding == "ISO-8859-1" and r.apparent_encoding:
  r.encoding = r.apparent_encoding

  # Prefer bytes + parser-level detection
  try:
  return lxml.html.fromstring(r.content)
  except (UnicodeDecodeError, ValueError):
  return lxml.html.fromstring(r.text)

This handles 95% of real-world encoding issues without intervention.

Don't strip, escape

When saving scraped data, encode it as UTF-8 explicitly so downstream consumers don't have to guess:

with open("output.json", "w", encoding="utf-8") as f:
  json.dump(data, f, ensure_ascii=False, indent=2)

ensure_ascii=False keeps non-ASCII characters as themselves (Año, not Año). Otherwise JSON escapes them, which is technically correct but harder to read.

Hands-on lab

Hit all three encoding challenges. For /challenges/static/encoding/broken, your first naive attempt should produce mojibake. Use r.apparent_encoding to detect the real encoding, override r.encoding, and confirm clean output. Then try the bytes + lxml approach without overriding r.encoding and compare results.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/static/encoding/broken

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Handling Encoding and Broken HTML1 / 8

If `Content-Type` doesn't declare a charset, what does the `requests` library default to?

Score so far: 0 / 0