Skip to content Skip to sidebar Skip to footer

Converting To Utf-8 (again)

I've this string Traor\u0102\u0160 Traor\u0102\u0160 Should produce Traoré. Then Traoré utf-8 decoded should produce Traorè How I can convert it to Traorè ? What kind of ch

Solution 1:

You need tell requests what encoding to expect:

>>> import requests
>>> r = requests.get(url)
>>> r.encoding = 'UTF-8'>>> r.json[u'Item'][u'LastName']
u'Traor\xe9'

Otherwise, you'll get this:

>>> r = requests.get(url)
>>> r.json['Item']['LastName']
u'Traor\u0102\u0160'

Solution 2:

You have run into a bug in requests; when the server does not set an explicit encoding, requests uses chardet to make an educated guess about the encoding.

In this particular case, it gets that wrong; chardet thinks it's ISO-8859-2 instead of UTF-8. The issue has been reported to the maintainers of requests as issue 765.

The maintainers closed that issue, blaming the problem on the server not setting a character encoding for the response. The work-around is to set r.encoding = 'utf-8' before accessing r.json so that the contents are correctly decoded without guessing.

However, as J.F. Sebastian correctly points out, if the response really is JSON, then the encoding has to be one of the UTF family of encodings. The JSON RFC even includes a section on how to detect what encoding was used.

I've submitted a pull request to the requests project that does just that; if you ask for the JSON decoded response, and no encoding has been set, it'll detect the correct UTF encoding used instead of guessing.

With this patch in place, the URL loads without setting the encoding explicitly:

>>> import requests
>>> r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json')
>>> r.json[u'Item'][u'LastName']
u'Traor\xe9'>>> print r.json[u'Item'][u'LastName']
Traoré

Solution 3:

For me your site returns "Traor\u00e9" (the last character is é):

r = requests.get(url)
print(json.dumps(json.loads(r.content)['Item']['LastName']))
# ->"Traor\u00e9"-> Traoré

r.json (r.text) produces incorrect content here. Either server or requests or both use incorrect encoding that results in "Traor\u0102\u0160". The encoding of JSON text is completely defined by its content therefore it is always possible to decode it whatever headers server sends, from json rfc:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

In this case there are no zero bytes at the start of r.content so json.loads works otherwise you need manually to convert it to a Unicode string if the server sends incorrect character encoding in Content-Type header or to workaround requests bug

Post a Comment for "Converting To Utf-8 (again)"