Python 2.7: Strange Unicode Behavior

September 14, 2022 Post a Comment

I am experiencing the following behavior in Python 2.7: >>> a1 = u'\U0001f04f' #1 >>> a2 = u'\ud83c\udc4f' #2 >>> a1 == a2 #3 False >>> a1.en

Solution 1:

A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:

>>> '\ud83c\udc4f'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:

>>> import sys
>>> sys.maxunicode
65535
>>> a1 = u'\U0001f04f'
>>> a2 = u'\ud83c\udc4f'
>>> a1 == a2
True

because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).

So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.

Getting Started with Python

Python 2.7: Strange Unicode Behavior

Solution 1:

Post a Comment for "Python 2.7: Strange Unicode Behavior"