Converting Double Slash Utf-8 Encoding

December 11, 2022 Post a Comment

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt: \xe6\x89\x8e\xe5\x8

Solution 1:

The problem is that the unicode_escape codec is implicitly decoding the result of the escape fixes by assuming the bytes are latin-1, not utf-8. You can fix this by:

# Read the file as bytes:
with open(myfile, 'rb') as f:
    data = f.read()

# Decode with unicode-escape to get Py2 unicode/Py3 str, but interpreted
# incorrectly as latin-1
badlatin = data.decode('unicode-escape')

# Encode back as latin-1 to get back the raw bytes (it's a 1-1 encoding),
# then decode them properly as utf-8
goodutf8 = badlatin.encode('latin-1').decode('utf-8')

Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with '\u624e\u52a0\u62c9' (Which should be correct, I'm just on a system without font support for those characters, so that's just the safe repr based on Unicode escapes). You could skip a step in Py2 by using the string-escape codec for the first stage decode (which I believe would allow you to omit the .encode('latin-1') step), but this solution should be portable, and the cost shouldn't be terrible.

Solution 2:

I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.

This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval:

from ast import literal_eval

with open('source.txt', 'r', encoding='utf-8') as f_open:
    source = f_open.read()
    string = literal_eval("b'{}'".format(source)).decode('utf-8')
    print(string)  # 扎加拉

Solution 3:

You can do some silly things like evaluating the string:

import ast
s = r'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
print ast.literal_eval('"%s"' % s).decode('utf-8')

note use ast.literal_eval if you don't want attackers to gain access to your system :-P

Using this in your case would probably look something like:

with open('file') as file_handle:
    data = ast.literal_eval('"%s"' % file.read()).decode('utf-8')

I think that the real issue here is likely that you have a file that contains strings representing bytes (rather than having a file that just stores the bytes themselves). So, fixing whatever code generated that file in the first place is probably a better bet. However, barring that, this is the next best thing that I could come up with ...

Solution 4:

Solution in Python3 with only string manipulations and encoding conversions without evil eval :)

import binascii

str = '\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89'
str = str.replace('\\x', '')  # str == 'e6898ee58aa0e68b89'

# we can use any encoding as long as it translate ascii as is,
# for example we can do str.encode('ascii') here
str = str.encode('utf8')  # str == b'e6898ee58aa0e68b89'

str = binascii.a2b_hex(str)  # str == b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
str = str.decode('utf8')  # str == '扎加拉'

If you like an one-liner, then we can put it simply as:

binascii.a2b_hex(str.replace('\\x', '').encode()).decode('utf8')

Solution 5:

at the end of day, what you get back is a string right? i would use string.replace method to convert double slash to single slash and add b prefix to make it work.

Getting Started with Python