Python 2.7: Strange Unicode Behavior
Solution 1:
A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:
>>> '\ud83c\udc4f'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:
>>> import sys
>>> sys.maxunicode
65535
>>> a1 = u'\U0001f04f'
>>> a2 = u'\ud83c\udc4f'
>>> a1 == a2
True
because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).
So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.
Post a Comment for "Python 2.7: Strange Unicode Behavior"