Skip to content Skip to sidebar Skip to footer

Python Utf-8 Latin-1 Displays Wrong Character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python). I tried a method like this: def latin1_to_unicode(character):

Solution 1:

Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å''\xc3\xa5'>>> 'å'.decode('latin1')
u'\xc3\xa5'>>> print'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

Post a Comment for "Python Utf-8 Latin-1 Displays Wrong Character"