Skip to content Skip to sidebar Skip to footer

Nltk Regexp Tokenizer Not Playing Nice With Decimal Point In Regex

I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen. I'm curre

Solution 1:

The culprit is:

\w+([-']\w+)*

\w+ will match numbers and since there's no . there, it will match only 3 in 3.14. Move the options around a bit so that \$?\d+(\.\d+)?%? is before the above regex part (so that the match is attempted first on the number format):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101 demo

Or in expanded form:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

Solution 2:

Try this regex:

\b\$?\d+(\.\d+)?%?\b

I surround the initial regex with word boundaries matching: \b.

Post a Comment for "Nltk Regexp Tokenizer Not Playing Nice With Decimal Point In Regex"