Handling Different Encodings (UTF-8, ISO-8859) - Python Scraping

Handle character encoding issues in web scraping. Detect, convert, and fix UTF-8, ISO-8859, and other encodings to avoid garbled text.

Garbled text in your scraped data, characters like Ã© instead of e or â€™ instead of an apostrophe, is almost always an encoding issue. Understanding how to detect and handle encodings is essential for scraping international websites.

How Encoding Works

Text on the web is sent as bytes. An encoding maps those bytes to characters. If you decode bytes with the wrong encoding, you get garbage.

Encoding	Coverage	Common On
UTF-8	All languages	Modern websites (90%+)
ISO-8859-1 (Latin-1)	Western European	Older European sites
Windows-1252	Western European	Legacy Windows sites
Shift_JIS	Japanese	Japanese websites
GB2312 / GBK	Chinese	Chinese websites

How Requests Handles Encoding

import requests

response = requests.get("https://quotes.toscrape.com/")

# What encoding requests detected
print(f"Apparent encoding: {response.apparent_encoding}")
print(f"Response encoding: {response.encoding}")

# response.text uses response.encoding to decode
# response.content is the raw bytes
text = response.text       # Decoded string
raw = response.content     # Raw bytes

The Most Common Problem

Requests sometimes guesses the wrong encoding. When the page looks garbled, fix it like this:

import requests

response = requests.get("https://example.com/french-page")

# Problem: requests guessed ISO-8859-1 but the page is UTF-8
# Fix 1: Override the encoding before accessing .text
response.encoding = "utf-8"
correct_text = response.text

# Fix 2: Decode raw bytes manually
correct_text = response.content.decode("utf-8")

Detecting Encoding Automatically

The chardet library analyzes bytes to guess the encoding.

pip install chardet

import requests
import chardet

response = requests.get("https://example.com/unknown-encoding")

# Detect encoding from raw bytes
detected = chardet.detect(response.content)
print(detected)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

# Use detected encoding
response.encoding = detected["encoding"]
text = response.text

A Robust Encoding Handler

import requests
import chardet
from bs4 import BeautifulSoup


def fetch_with_encoding(url):
    """Fetch a page with proper encoding detection."""
    response = requests.get(url, timeout=15)

    # Strategy 1: Check HTTP Content-Type header
    content_type = response.headers.get("Content-Type", "")
    if "charset=" in content_type:
        encoding = content_type.split("charset=")[-1].strip()
        response.encoding = encoding
        return response.text

    # Strategy 2: Check HTML meta tag
    soup = BeautifulSoup(response.content, "html.parser")
    meta_charset = soup.find("meta", charset=True)
    if meta_charset:
        response.encoding = meta_charset["charset"]
        return response.text

    meta_content_type = soup.find("meta", {"http-equiv": "Content-Type"})
    if meta_content_type:
        content = meta_content_type.get("content", "")
        if "charset=" in content:
            encoding = content.split("charset=")[-1].strip()
            response.encoding = encoding
            return response.text

    # Strategy 3: Detect from bytes
    detected = chardet.detect(response.content)
    if detected["confidence"] > 0.7:
        response.encoding = detected["encoding"]
        return response.text

    # Fallback: UTF-8
    response.encoding = "utf-8"
    return response.text


text = fetch_with_encoding("https://quotes.toscrape.com/")
print(text[:200])

Handling Encoding in BeautifulSoup

BeautifulSoup can handle encoding when you pass raw bytes.

from bs4 import BeautifulSoup
import requests

response = requests.get("https://quotes.toscrape.com/")

# Pass bytes, not string, BeautifulSoup will detect encoding
soup = BeautifulSoup(response.content, "html.parser")

# Check what encoding BeautifulSoup detected
print(f"Detected encoding: {soup.original_encoding}")

# All text output is now proper Unicode
title = soup.select_one("title").get_text()
print(title)

Saving Encoded Data Correctly

import json

data = [
    {"text": "Cafe au lait", "author": "Rene"},
    {"text": "Uber cool", "author": "Hans"},
]

# Always write with UTF-8 encoding
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# For CSV
import csv
with open("data.csv", "w", newline="", encoding="utf-8-sig") as f:
    # utf-8-sig adds BOM for Excel compatibility
    writer = csv.DictWriter(f, fieldnames=["text", "author"])
    writer.writeheader()
    writer.writerows(data)

Tips

Always pass response.content (bytes) to BeautifulSoup rather than response.text, let BeautifulSoup handle encoding detection.
Use encoding="utf-8" when writing output files.
For CSV files opened in Excel, use utf-8-sig encoding to add a BOM (Byte Order Mark) so Excel recognizes the encoding.
Services like ScrapingAnt return content with proper encoding handling, which can save you from debugging encoding issues.

Next Steps

Learn to scrape XML and RSS feeds, which have their own encoding considerations
Build scrapers for international websites with mixed encodings