Nltk Regexp Tokenizer Not Playing Nice With Decimal Point In Regex
I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen. I'm curre
Solution 1:
The culprit is:
\w+([-']\w+)*
\w+
will match numbers and since there's no .
there, it will match only 3
in 3.14
. Move the options around a bit so that \$?\d+(\.\d+)?%?
is before the above regex part (so that the match is attempted first on the number format):
(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]
Or in expanded form:
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
| \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
| [+/\-@&*] # special characters with meanings
'''
Solution 2:
Try this regex:
\b\$?\d+(\.\d+)?%?\b
I surround the initial regex with word boundaries matching: \b
.
Post a Comment for "Nltk Regexp Tokenizer Not Playing Nice With Decimal Point In Regex"