Unicode Character Not In Range When Calling Locale.strxfrm

March 12, 2024 Post a Comment

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example: >>> x = '\U0010fefd' >>> ord(x) 1113853 &

Solution 1:

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:

The transformation shall be such that if wcscmp() is applied to two transformed wide strings, it shall return a value greater than, equal to, or less than 0, corresponding to the result of wcscoll() applied to the same two original wide-character strings.

This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.

I've created a little C code example to demonstrate how it works:

#include<stdio.h>#include<wchar.h>#include<locale.h>intmain(){
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

It prints the string before and after the transformation.

Running it on Linux (Debian Jessie) this is the result:

in :0x10fefdout:0x10x10x10x10x552

while running it on OSX (10.11.1) the result is:

in :0x10fefdout:0x1030x10x110000

You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.

On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.

UPDATE:

Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:

order \
    \x00;...;\xff

2nd UPDATE:

I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:

W1..Wn \x01 U1..Un

where

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

Using this algorithm \x10fefd become 0x103 0x1 0x110000

I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.

Getting Started with Python

Unicode Character Not In Range When Calling Locale.strxfrm

Solution 1:

Post a Comment for "Unicode Character Not In Range When Calling Locale.strxfrm"