How To Get The Non-ascii Letters From A File, Without Them Being "corrupted"?
I want to know how to read the letters from non-ASCII encoding, without them being 'corrupted'. Here is the recreation: print(open('somefile.txt').read()) somefile.txt (saved as
Solution 1:
You're opening the file as cp1252
, you should open it as utf-16
.
(ÿþ
is indicative of the UTF-16LE Byte Order Mark being wrongly interpreted as Windows-1252.)
>>> open('foo.txt', encoding='utf-16').read()
'čđža'
>>> open('foo.txt', encoding='cp1252').read()
'ÿþ\n\x01\x11\x01~\x01a\x00'
On a unix system, you can use file
to see what's in the file:
~$ file foo.txt
foo.txt: Little-endian UTF-16Unicodetext, with no line terminators
In Python, the chardet library is good for this:
>>> chardet.detect(open('foo.txt', 'rb').read())
{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
Post a Comment for "How To Get The Non-ascii Letters From A File, Without Them Being "corrupted"?"