Python Utf-8 Latin-1 Displays Wrong Character

April 16, 2024 Post a Comment

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python). I tried a method like this: def latin1_to_unicode(character):

Solution 1:

Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å''\xc3\xa5'>>> 'å'.decode('latin1')
u'\xc3\xa5'>>> print'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Getting Started with Python

Python Utf-8 Latin-1 Displays Wrong Character

Solution 1:

Post a Comment for "Python Utf-8 Latin-1 Displays Wrong Character"