Converting Double Slash Utf-8 Encoding
Solution 1:
The problem is that the unicode_escape
codec is implicitly decoding the result of the escape fixes by assuming the bytes are latin-1
, not utf-8
. You can fix this by:
# Read the file as bytes:
with open(myfile, 'rb') as f:
data = f.read()
# Decode with unicode-escape to get Py2 unicode/Py3 str, but interpreted
# incorrectly as latin-1
badlatin = data.decode('unicode-escape')
# Encode back as latin-1 to get back the raw bytes (it's a 1-1 encoding),
# then decode them properly as utf-8
goodutf8 = badlatin.encode('latin-1').decode('utf-8')
Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with '\u624e\u52a0\u62c9'
(Which should be correct, I'm just on a system without font support for those characters, so that's just the safe repr
based on Unicode escapes). You could skip a step in Py2 by using the string-escape
codec for the first stage decode
(which I believe would allow you to omit the .encode('latin-1')
step), but this solution should be portable, and the cost shouldn't be terrible.
Solution 2:
I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.
This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval
:
from ast import literal_eval
with open('source.txt', 'r', encoding='utf-8') as f_open:
source = f_open.read()
string = literal_eval("b'{}'".format(source)).decode('utf-8')
print(string) # 扎加拉
Solution 3:
You can do some silly things like eval
uating the string:
import ast
s = r'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
print ast.literal_eval('"%s"' % s).decode('utf-8')
- note use
ast.literal_eval
if you don't want attackers to gain access to your system :-P
Using this in your case would probably look something like:
with open('file') as file_handle:
data = ast.literal_eval('"%s"' % file.read()).decode('utf-8')
I think that the real issue here is likely that you have a file that contains strings representing bytes (rather than having a file that just stores the bytes themselves). So, fixing whatever code generated that file in the first place is probably a better bet. However, barring that, this is the next best thing that I could come up with ...
Solution 4:
Solution in Python3 with only string manipulations and encoding conversions without evil eval
:)
import binascii
str = '\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89'
str = str.replace('\\x', '') # str == 'e6898ee58aa0e68b89'
# we can use any encoding as long as it translate ascii as is,
# for example we can do str.encode('ascii') here
str = str.encode('utf8') # str == b'e6898ee58aa0e68b89'
str = binascii.a2b_hex(str) # str == b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
str = str.decode('utf8') # str == '扎加拉'
If you like an one-liner, then we can put it simply as:
binascii.a2b_hex(str.replace('\\x', '').encode()).decode('utf8')
Solution 5:
at the end of day, what you get back is a string right? i would use string.replace method to convert double slash to single slash and add b prefix to make it work.
Post a Comment for "Converting Double Slash Utf-8 Encoding"