How To Convert Repr Into Encoded String

June 25, 2024 Post a Comment

I have this str (coming from a file I can't fix): In [131]: s Out[131]: '\\xce\\xb8Oph' This is close to the repr of a string encoded in utf8: In [132]: repr('θOph'.encode('utf8'

Solution 1:

Your solution is OK, the only thing is that eval is dangerous when used with arbitrary inputs. The safe alternative is to use ast.literal_eval:

>>>s = '\\xce\\xb8Oph'>>>from ast import literal_eval>>>literal_eval("b'{}'".format(s)).decode('utf8')
'\u03b8Oph'

With eval you are subject to:

>>> eval("b'{}'".format("1' and print('rm -rf /') or b'u r owned")).decode('utf8')
rm -rf /
'u r owned'

Since ast.literal_eval is the opposite of repr for literals, I guess it is what you are looking for.

[updade]

If you have a file with escaped unicode, you may want to open it with the unicode_escape encoding as suggested in the answer by Ginger++. I will keep my answer because the question was "how to convert repr into encoded string", not "how to decode file with escaped unicode".

Solution 2:

Just open your file with unicode_escape encoding, like:

withopen('name', encoding="unicode_escape") as f:
    pass# your code here

Original answer:

>>> '\\xce\\xb8Oph'.encode('utf-8').decode('unicode_escape')
'Î¸Oph'

You can get rid of that encoding to UTF-8, if you read your file in binary mode instead of text mode:

>>> b'\\xce\\xb8Oph'.decode('unicode_escape')
'Î¸Oph'

Solution 3:

Unfortunately, this is really problematic. It's \ killing you softly here.

I can only think of:

s = '\\xce\\xb8Oph\\r\\nMore test\\t\\xc5\\xa1'
n = ""
x = 0while x!=len(s):
    if s[x]=="\\":
        sx = s[x+1:x+4]
        marker = sx[0:1]
        if   marker=="x": n += chr(int(sx[1:], 16)); x += 4elif marker in ("'", '"', "\\", "n", "r", "v", "t", "0"):
            # Pull this dict out of a loop to speed things up
            n += {"'": "'", '"': '"', "\\": "\\", "n": "\n", "r": "\r", "t": "\t", "v": "\v", "0": "\0"}[marker]
            x += 2else: n += s[x]; x += 1else: n += s[x]; x += 1printrepr(n), repr(s)
printrepr(n.decode("UTF-8"))

There might be some other trick to pull this off, but at the moment this is all I got.

Solution 4:

To make a teeny improvement on GingerPlusPlus's answer:

import tempfile                                                        

with tempfile.TemporaryFile(mode='rb+') as f:                          
    f.write(r'\xce\xb8Oph'.encode())                                   
    f.flush()                                                          
    f.seek(0)                                                          

    print(f.read().decode('unicode_escape').encode('latin1').decode())

If you open the file in binary mode (i.e. rb, since you're reading, I added + since I was also writing to the file) you can skip the first encode call. It's still awkward, because you have to bounce through the decode/encode hop, but you at least do get to avoid that first encoding call.

Getting Started with Python