Handling Different Encodings (UTF-8, ISO-8859)
Handle character encoding issues in web scraping. Detect, convert, and fix UTF-8, ISO-8859, and other encodings to avoid garbled text.
Garbled text in your scraped data, characters like é instead of e or ’ instead of an apostrophe, is almost always an encoding issue. Understanding how to detect and handle encodings is essential for scraping international websites.
How Encoding Works
Text on the web is sent as bytes. An encoding maps those bytes to characters. If you decode bytes with the wrong encoding, you get garbage.
| Encoding | Coverage | Common On |
|---|---|---|
| UTF-8 | All languages | Modern websites (90%+) |
| ISO-8859-1 (Latin-1) | Western European | Older European sites |
| Windows-1252 | Western European | Legacy Windows sites |
| Shift_JIS | Japanese | Japanese websites |
| GB2312 / GBK | Chinese | Chinese websites |
How Requests Handles Encoding
import requests
response = requests.get("https://quotes.toscrape.com/")
# What encoding requests detected
print(f"Apparent encoding: {response.apparent_encoding}")
print(f"Response encoding: {response.encoding}")
# response.text uses response.encoding to decode
# response.content is the raw bytes
text = response.text # Decoded string
raw = response.content # Raw bytes
The Most Common Problem
Requests sometimes guesses the wrong encoding. When the page looks garbled, fix it like this:
import requests
response = requests.get("https://example.com/french-page")
# Problem: requests guessed ISO-8859-1 but the page is UTF-8
# Fix 1: Override the encoding before accessing .text
response.encoding = "utf-8"
correct_text = response.text
# Fix 2: Decode raw bytes manually
correct_text = response.content.decode("utf-8")
Detecting Encoding Automatically
The chardet library analyzes bytes to guess the encoding.
pip install chardet
import requests
import chardet
response = requests.get("https://example.com/unknown-encoding")
# Detect encoding from raw bytes
detected = chardet.detect(response.content)
print(detected)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# Use detected encoding
response.encoding = detected["encoding"]
text = response.text
A Robust Encoding Handler
import requests
import chardet
from bs4 import BeautifulSoup
def fetch_with_encoding(url):
"""Fetch a page with proper encoding detection."""
response = requests.get(url, timeout=15)
# Strategy 1: Check HTTP Content-Type header
content_type = response.headers.get("Content-Type", "")
if "charset=" in content_type:
encoding = content_type.split("charset=")[-1].strip()
response.encoding = encoding
return response.text
# Strategy 2: Check HTML meta tag
soup = BeautifulSoup(response.content, "html.parser")
meta_charset = soup.find("meta", charset=True)
if meta_charset:
response.encoding = meta_charset["charset"]
return response.text
meta_content_type = soup.find("meta", {"http-equiv": "Content-Type"})
if meta_content_type:
content = meta_content_type.get("content", "")
if "charset=" in content:
encoding = content.split("charset=")[-1].strip()
response.encoding = encoding
return response.text
# Strategy 3: Detect from bytes
detected = chardet.detect(response.content)
if detected["confidence"] > 0.7:
response.encoding = detected["encoding"]
return response.text
# Fallback: UTF-8
response.encoding = "utf-8"
return response.text
text = fetch_with_encoding("https://quotes.toscrape.com/")
print(text[:200])
Handling Encoding in BeautifulSoup
BeautifulSoup can handle encoding when you pass raw bytes.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://quotes.toscrape.com/")
# Pass bytes, not string, BeautifulSoup will detect encoding
soup = BeautifulSoup(response.content, "html.parser")
# Check what encoding BeautifulSoup detected
print(f"Detected encoding: {soup.original_encoding}")
# All text output is now proper Unicode
title = soup.select_one("title").get_text()
print(title)
Saving Encoded Data Correctly
import json
data = [
{"text": "Cafe au lait", "author": "Rene"},
{"text": "Uber cool", "author": "Hans"},
]
# Always write with UTF-8 encoding
with open("data.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# For CSV
import csv
with open("data.csv", "w", newline="", encoding="utf-8-sig") as f:
# utf-8-sig adds BOM for Excel compatibility
writer = csv.DictWriter(f, fieldnames=["text", "author"])
writer.writeheader()
writer.writerows(data)
Tips
- Always pass
response.content(bytes) to BeautifulSoup rather thanresponse.text, let BeautifulSoup handle encoding detection. - Use
encoding="utf-8"when writing output files. - For CSV files opened in Excel, use
utf-8-sigencoding to add a BOM (Byte Order Mark) so Excel recognizes the encoding. - Services like ScrapingAnt return content with proper encoding handling, which can save you from debugging encoding issues.
Next Steps
- Learn to scrape XML and RSS feeds, which have their own encoding considerations
- Build scrapers for international websites with mixed encodings