Skip to content Skip to sidebar Skip to footer

How To Get The Non-ascii Letters From A File, Without Them Being "corrupted"?

I want to know how to read the letters from non-ASCII encoding, without them being 'corrupted'. Here is the recreation: print(open('somefile.txt').read()) somefile.txt (saved as

Solution 1:

You're opening the file as cp1252, you should open it as utf-16.

(ÿþ is indicative of the UTF-16LE Byte Order Mark being wrongly interpreted as Windows-1252.)

>>> open('foo.txt', encoding='utf-16').read()
'čđža'
>>> open('foo.txt', encoding='cp1252').read()
'ÿþ\n\x01\x11\x01~\x01a\x00'

On a unix system, you can use file to see what's in the file:

~$ file foo.txt
foo.txt: Little-endian UTF-16Unicodetext, with no line terminators

In Python, the chardet library is good for this:

>>> chardet.detect(open('foo.txt', 'rb').read())
{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

Post a Comment for "How To Get The Non-ascii Letters From A File, Without Them Being "corrupted"?"