What's The Fastest Way To Strip And Replace A Document Of High Unicode Characters Using Python?

April 05, 2024 Post a Comment

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with 'normal' counterparts in the low range, such as a

Solution 1:

# -*- encoding: utf-8 -*-import unicodedata

defshoehorn_unicode_into_ascii(s):
    return unicodedata.normalize('NFKD', s).encode('ascii','ignore')

if __name__=='__main__':
    s = u"éèêàùçÇ"print(shoehorn_unicode_into_ascii(s))
    # eeeaucC

Note, as @Mark Tolonen kindly points out, the method above removes some characters like ß‘’“”. If the above code truncates characters that you wish translated, then you may have to use the string's translate method to manually fix these problems. Another option is to use unidecode (see J.F. Sebastian's answer).

When you have a large unicode string, using its translate method will be much much faster than using the replace method.

Edit:unidecode has a more complete mapping of unicode codepoints to ascii. However, unidecode.unidecode loops through the string character-by-character (in a Python loop), which is slower than using the translate method.

The following helper function uses unidecode's data files, and the translate method to attain better speed, especially for long strings.

In my tests on 1-6 MB text files, using ascii_map is about 4-6 times faster than unidecode.unidecode.

# -*- coding: utf-8 -*-import unidecode
defascii_map():
    data={}
    for num inrange(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            passelse:
            for l,val inenumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u"éèêàùçÇ"print(s.translate(ascii_map()))
    # eeeaucC

Edit2: Rhubarb, if # -*- encoding: utf-8 -*- is causing a SyntaxError, try # -*- encoding: cp1252 -*-. What encoding to declare depends on what encoding your text editor uses to save the file. Linux tends to use utf-8, and (it seems perhaps) Windows tends to cp1252.

Solution 2:

There is no such thing as a "high ascii character". The ASCII character set is limited to ordinal in range(128).

That aside, this is a FAQ. Here's one answer. In general, you should familiarise yourself with str.translate() and unicode.translate() -- very handy for multiple substitutions of single bytes/characters. Beware of answers that mention only the unicodedata.normalize() gimmick; that's just one part of the solution.

Update: The currently-accepted answer blows away characters that don't have a decomposition, as pointed out by Mark Tolonen. There seems to be a lack of knowledge of what unicode.translate() is capable of. It CAN translate one character into multiple characters. Here is the output from help(unicode.translate):

S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.

Here's an example:

>>> u"Gau\xdf".translate({0xdf: u"ss"})
u'Gauss'
>>>

Here's a table of fix-ups from the solution that I pointed to:

CHAR_REPLACEMENT = {
    # latin-1 characters that don't have a unicode decomposition0xc6: u"AE", # LATIN CAPITAL LETTER AE0xd0: u"D",  # LATIN CAPITAL LETTER ETH0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE0xde: u"Th", # LATIN CAPITAL LETTER THORN0xdf: u"ss", # LATIN SMALL LETTER SHARP S0xe6: u"ae", # LATIN SMALL LETTER AE0xf0: u"d",  # LATIN SMALL LETTER ETH0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE0xfe: u"th", # LATIN SMALL LETTER THORN
    }

This can be easily extended to cater for the fancy quotes and other non-latin-1 characters found in cp1252 and siblings.

Solution 3:

I believe that unicodedata doesn't work for fancy quotes. You could use Unidecode in this case:

import unidecode
print unidecode.unidecode(u"ß‘’“”")
# -> ss''""

Solution 4:

If unicodedata.normalize() as suggested by ~unubtu doesn't do the trick, for example if you want more control over the mapping, you should look into str.translate() along with str.maketrans(), a utility to produce a map table, str.translate is both efficient and convenient for this type of translation. In Python 2.x and for unicode strings one needs to use unicode.translate() rather than str.translate() and a trick similar to the one shown in the code snippet below, in lieu of maketrans(). (thanks to John Machin for pointing this out!)

These methods are also availble in in Python 3.x see for example the Python 3.1.2 documentation (for some reason I had made a mental note that this may have changed in Python 3.x). Of course under Python 3, all strings are unicode strings, but that's other issue.

#Python 3.1
>>>intab = 'àâçêèéïîôù'>>>outtab = 'aaceeeiiou'>>>tmap = str.maketrans(intab, outtab)>>>s = "à la fête de l'été, où il fait bon danser, les Français font les drôles">>>s
"à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>>s.translate(tmap)
"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>


#Python 2.6
>>>intab = u'àâçêèéïîôù'>>>outtab = u'aaceeeiiou'>>>s = u"à la fête de l'été, où il fait bon danser, les Français font les drôles">>>#note the trick to replace maketrans() since for unicode strings the translation>>>#     map expects integers (unicode ordinals) not characters.>>>tmap = dict(zip(map(ord, intab), map(ord, outtab))) >>>s.translate(tmap)
u"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>

Solution 5:

Here's a solution that handles latin-1 characters (based on a 2003 usenet thread):

>>>accentstable = str.join("", map(chr, range(192))) + "AAAAAAACEEEEIIIIDNOOOOOxOUUUUYTsaaaaaaaceeeeiiiidnooooo/ouuuuyty">>>import string>>>s = u"éèêàùçÇ">>>print string.translate(s.encode('latin1', 'ignore'), accentstable)
eeeaucC

Some of the mappings aren't perfect e.g. Thorn maps to T rather than Th, but it does a tolerable job.

Getting Started with Python