Skip to content Skip to sidebar Skip to footer

How Do I Compare Characters With Combining Diacritic Marks ɔ̃, ɛ̃ And ɑ̃ To Unaccented Ones In Python (imported From A Utf-8 Encoded Text File)?

Summary: I want to compare ɔ̃, ɛ̃ and ɑ̃ to ɔ, ɛ and a, which are all different, but my text file has ɔ̃, ɛ̃ and ɑ̃ written as ɔ~, ɛ~ and a~. I wrote a script whi

Solution 1:

Unicode normalization does not help for described particular character combinations because an excerpt from Unicode database UnicodeData.Txt using simple regex "Latin.*Letter.*with tilde$" gives ÃÑÕãñõĨĩŨũṼṽẼẽỸỹ (no Latin letters Open O, Open E or Alpha). So you need to iterate through both compared strings separately as follows (omitted most of your code above a Minimal, Reproducible Example):

import unicodedata

def lens(word):
    return len(word)

input_lines = ['alyʁ/alɔʁ', 'ɑ̃bisjø/ɑ̃bisjɔ̃ ', 'osi/ɛ̃si', 'bɛ̃ /bɔ̃ ', 'bo/ba', 'bjɛ/bjɛ̃ ']
print(len(input_lines))
for line in input_lines:
    print('')
    #find word ipa transctipts
    line = unicodedata.normalize('NFKC', line.rstrip('\n'))
    line = line.split("/")
    line.sort(key = lens)
    word1, word2 = line[0:2] # the shortest two strings after splitting are the ipa words
    index = i1 = i2 = 0
    while i1 < len(word1) and i2 < len(word2):
        letter1 = word1[i1]
        i1 += 1
        if i1 < len(word1) and unicodedata.category(word1[i1]) == 'Mn':
            letter1 += word1[i1]
            i1 += 1
        letter2 = word2[i2]
        i2 += 1
        if i2 < len(word2) and unicodedata.category(word2[i2]) == 'Mn':
            letter2 += word2[i2]
            i2 += 1
        same = chr(0xA0) if letter1 == letter2 else '#' 
        print(index, same, word1, word2, letter1, letter2)
        index += 1
        #if same != chr(0xA0):
        #    break

Output: .\SO\67335977.py

6

0   alyʁ alɔʁ a a
1   alyʁ alɔʁ l l
2 # alyʁ alɔʁ y ɔ
3   alyʁ alɔʁ ʁ ʁ

0   ɑ̃bisjø ɑ̃bisjɔ̃  ɑ̃ ɑ̃
1   ɑ̃bisjø ɑ̃bisjɔ̃  b b
2   ɑ̃bisjø ɑ̃bisjɔ̃  i i
3   ɑ̃bisjø ɑ̃bisjɔ̃  s s
4   ɑ̃bisjø ɑ̃bisjɔ̃  j j
5 # ɑ̃bisjø ɑ̃bisjɔ̃  ø ɔ̃

0 # osi ɛ̃si o ɛ̃
1   osi ɛ̃si s s
2   osi ɛ̃si i i

0   bɛ̃  bɔ̃  b b
1 # bɛ̃  bɔ̃  ɛ̃ ɔ̃
2   bɛ̃  bɔ̃

0   bo ba b b
1 # bo ba o a

0   bjɛ bjɛ̃  b b
1   bjɛ bjɛ̃  j j
2 # bjɛ bjɛ̃  ɛ ɛ̃

Note: diacritic tested as Unicode category Mn; you can test against another condition (e.g. from the following list):

  • Mn Nonspacing_Mark: a nonspacing combining mark (zero advance width)
  • Mc Spacing_Mark : a spacing combining mark (positive advance width)
  • Me Enclosing_Mark : an enclosing combining mark
  • M Mark : Mn | Mc | Me

Solution 2:

I am in the process of solving this by just doing a find and replace on these characters before processing it and a reverse find and replace when I'm done.


Post a Comment for "How Do I Compare Characters With Combining Diacritic Marks ɔ̃, ɛ̃ And ɑ̃ To Unaccented Ones In Python (imported From A Utf-8 Encoded Text File)?"