Handling Encoding and Broken HTML
Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.
What you’ll learn
- Understand the difference between `response.content`, `response.text`, and `response.encoding`.
- Detect a page's actual encoding from headers, BOM, and meta tags.
- Use `chardet` / `charset-normalizer` for tricky cases.
- Choose a forgiving parser for broken HTML.
You'll meet pages that look fine in a browser but parse to garbage in your scraper. Almost always: an encoding mismatch. This lesson is the systematic fix.
The three layers of encoding
When a page leaves a server and arrives at your scraper, encoding can be declared in three places:
- HTTP response header:
Content-Type: text/html; charset=UTF-8 - HTML
<meta>tag:<meta charset="UTF-8">or<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> - Document BOM (byte-order mark): first 2-3 bytes of the file indicating UTF-8/16/32.
If they disagree, decoders can pick the wrong one and you get mojibake (Año → Año).
How requests decides
import requests
r = requests.get(url)
print(r.encoding) # what requests is using to decode .text
print(r.apparent_encoding) # what chardet/charset-normalizer thinks
requests looks at the HTTP Content-Type: charset=.... If present, that's what it uses. If absent, it falls back to ISO-8859-1 (a legacy default, often wrong for real-world UTF-8 pages without an explicit header).
r.apparent_encoding runs an encoding detector on the body bytes and returns its best guess. Use it when r.encoding looks suspicious:
if r.encoding == "ISO-8859-1" and r.apparent_encoding != "ISO-8859-1":
r.encoding = r.apparent_encoding
text = r.text
That single check fixes most "weird characters" bugs.
Use bytes, then decode explicitly
The cleanest pattern for tricky pages: skip r.text entirely, work with r.content bytes, and let the parser detect the encoding from the HTML's own declarations:
import lxml.html
tree = lxml.html.fromstring(r.content)
lxml respects <meta charset> in the document. So does BeautifulSoup with the lxml parser. This is the reason production scrapers prefer r.content + parser over r.text.
When even the page lies
Sometimes the <meta charset> is wrong. Real example: a page declares UTF-8 but is actually Windows-1252. Tell the parser explicitly:
import lxml.html
tree = lxml.html.fromstring(r.content.decode("windows-1252"))
Or use a detector:
import charset_normalizer
detected = charset_normalizer.from_bytes(r.content).best()
text = str(detected)
print("Detected encoding:", detected.encoding)
charset-normalizer ships with modern requests and is what powers r.apparent_encoding. The older chardet library does the same job, slightly differently.
The three labs at Catalog108
/challenges/static/encoding/utf8, clean UTF-8, declared correctly. Should work out of the box./challenges/static/encoding/latin1, ISO-8859-1 page with the correctContent-Type.r.textworks but parser must be told./challenges/static/encoding/broken, page declares one charset but is encoded in another. Tests your detection fallback.
Run your scraper against all three; that's the lab.
Broken HTML, the other half of the problem
HTML in the wild violates the spec constantly:
<table>
<tr><td>cell 1
<tr><td>cell 2
</table>
Missing closing tags, no quotes on attributes, mixed case, content outside table cells, scripts injected mid-DOM. Modern HTML parsers are forgiving:
| Parser | Tolerance | Speed |
|---|---|---|
Python html.parser (stdlib) |
Tolerant of most quirks | Slow |
lxml HTML mode |
Very tolerant | Fast |
lxml XML/etree |
Strict, refuses broken markup | Fast |
html5lib |
The most spec-correct, slowest | Slow |
For HTML scraping, use lxml first. If a particularly broken page mis-parses, try html5lib as a fallback, it's slower but mimics what a browser does:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html5lib")
For sitemaps and RSS feeds (which ARE XML), use the strict XML parser, broken XML there usually means the feed is genuinely broken and you want to know.
Garbage characters: trace and fix
If your output has Â, é, ’, etc., you're almost certainly looking at UTF-8 bytes interpreted as Latin-1 / Windows-1252. The fix is to decode correctly at the source, re-decoding the already-broken string usually fails.
Workflow:
# 1. Print bytes hex around the suspected character
i = r.text.find("Â")
print(r.content[i-5:i+5].hex())
# 2. Cross-check: what's the actual encoding?
print("requests says:", r.encoding)
print("apparent:", r.apparent_encoding)
print("meta charset:", re.search(r'charset=([\w-]+)', r.text)[1])
If three values disagree, the page itself is lying. Force the right one.
Pragmatic boilerplate
import requests
import lxml.html
def fetch_and_parse(url):
r = requests.get(url, timeout=10)
r.raise_for_status()
# If requests fell back to ISO-8859-1 but the detector says otherwise, override
if r.encoding == "ISO-8859-1" and r.apparent_encoding:
r.encoding = r.apparent_encoding
# Prefer bytes + parser-level detection
try:
return lxml.html.fromstring(r.content)
except (UnicodeDecodeError, ValueError):
return lxml.html.fromstring(r.text)
This handles 95% of real-world encoding issues without intervention.
Don't strip, escape
When saving scraped data, encode it as UTF-8 explicitly so downstream consumers don't have to guess:
with open("output.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
ensure_ascii=False keeps non-ASCII characters as themselves (Año, not Año). Otherwise JSON escapes them, which is technically correct but harder to read.
Hands-on lab
Hit all three encoding challenges. For /challenges/static/encoding/broken, your first naive attempt should produce mojibake. Use r.apparent_encoding to detect the real encoding, override r.encoding, and confirm clean output. Then try the bytes + lxml approach without overriding r.encoding and compare results.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/encoding/brokenQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.