Skip to content Skip to sidebar Skip to footer

Tokenizing Non English Text In Python

I have a Persian text file that has some lines like this: ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف I want to generate

Solution 1:

Using regex package:

>>> import regex
>>> text = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'>>> regex.findall(r'\p{L}+', text.replace('\u200c', ''))
['ذوب', 'خوی', 'بزاق', 'آبدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']
  • The text contains ZERO WIDTH NON-JOINER (U+200C). removed the character using str.replace.
  • \p{L} or \p{Letter} matches any kind of letter from any language.

See Regex Tutorial - Unicode Characters and Properties.

UPDATE

To also include U+200C, use [\p{Cf}\p{L}]+ instead (\p{Cf} or \p{Format} matches invisible formatting character):

>>> regex.findall(r'[\p{Cf}\p{L}]+', text)
['ذوب', 'خوی', 'بزاق', 'آب\u200cدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']

It looks diffent from what you want, but they are equal:

>>> got = regex.findall(r'[\p{Cf}\p{L}]+', text)
>>> want = [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف']
>>> print(want)
['ذوب', 'خوی', 'بزاق', 'آب\u200cدهان', 'یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']
>>> got == want
>>> got[:3]
['ذوب', 'خوی', 'بزاق']
>>> got[4:]
['یم', 'زهاب', 'آبرو', 'حیثیت', 'شرف']

UPDATE2

Some words in the edited question contains a space.

>>> ' 'in'منهدم کردن'True

I added \s in the following code to also match the spaces, then strip the leading, trailing spaces from the matched strings, then filtered out empty strings.

>>>text = 'منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن'>>>want = ['منهدم کردن','خراب کردن', 'ویران کردن', 'تخریب کردن','نابود کردن', 'از بین بردن']>>>[x for x  inmap(str.strip, regex.findall(r'[\p{Cf}\p{L}\s]+', text)) if x] == want
True

Solution 2:

Use re.split to split on whitespace (\s), digits (\d) and the ، character.

# python 3import re
INPUT = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'
EXPECTED = [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف'] 

OUTPUT = re.split('[\s\d،]+', INPUT)
assert OUTPUT == EXPECTED
print('\n'.join(OUTPUT))

Note the \u200c you are seeing in the output array is a non-printing character, and is actually contained in the original string. Python is escaping it as it is showing the representation of the array and contained strings, not printing the string for display. Here's the difference:

INPUT = 'ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'print(INPUT)
ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف

print(repr(INPUT)) # notice the \u200c below'ذوب 6 خوی 7 بزاق ،آب\u200cدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف'print(['in', 'an', 'array', INPUT]) # the \u200c is also shown when printing an array
['in', 'an', 'array', 'ذوب 6 خوی 7 بزاق ،آب\u200cدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف']

This is similar to how python handles newline characters:

>>>'new\nline'
'new\nline'
>>>print'new\nline'
new
line

Edit:

Here is the regex for your updated sample that uses falsetru's findall strategy, but uses the built-in re module:

OUTPUT = [s.strip() for s in re.findall(r'(?:[^\W\d_]|[\s])+', INPUT) if s.strip()]

The pattern (?:[^\W\d_]|[\s])+ is a little strange, as Python's re module has no equivalent to regex's "Letters" \p{L}, so instead we use the solution proposed here https://stackoverflow.com/a/8923988/66349

[^\W\d_] - (not ((not alphanumeric) or digits or underscore))

So in summary, match one or more characters (+) that are either (|): Unicode letters [^\W\d_, or whitespace \s.

falsetru's method is probably more readable, but requires the 3rd party library.

Post a Comment for "Tokenizing Non English Text In Python"