Unable To Detect Gibberish Names Using Python

August 08, 2023 Post a Comment

I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate ac

Solution 1:

For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.

A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).

from nltk.corpus import brown
from collections import Counter
import numpy as np

text = '\n  '.join([' '.join([w for w in s]) for s in brown.sents()])

unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i inrange(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i inrange(len(text)-3))

weights = [0.001, 0.01, 0.989]

defstrangeness(text):
    r = 0
    text = '  ' + text + '\n'for i inrange(2, len(text)):
        char = text[i]
        context1 = text[(i-1):i]
        context2 = text[(i-2):i]
        num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2] 
        den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
        r -= np.log(num / den)
    return r / (len(text) - 2)

Now you can apply this strangeness measure to your examples.

t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
    print('{:20} -> {:9.5}'.format(t, strangeness(t)))

You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.

128->5.5528127->5.6572
h4rugz4sx383a6n64hpo ->5.9016
tt                   ->4.9392
t66                  ->6.9673
t65                  ->6.8501
asdfds               ->3.9776
Michael              ->3.3598
sara                 ->3.8171
jose colmenares      ->2.9539
Dimitar              ->3.4602
Jose Rafael          ->3.4604
Morgan               ->3.3628
Eduardo Medina       ->3.2586
Luis R. Mendez       ->3.566
Hikaru               ->3.8936
SELENIA              ->6.1829
Zhang Ming           ->3.4809
Xuting Liu           ->3.7161
Chen Zheng           ->3.6212

Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.

Getting Started with Python

Unable To Detect Gibberish Names Using Python

Solution 1:

Post a Comment for "Unable To Detect Gibberish Names Using Python"