Unable To Detect Gibberish Names Using Python
I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate ac
Solution 1:
For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.
A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).
from nltk.corpus import brown
from collections import Counter
import numpy as np
text = '\n '.join([' '.join([w for w in s]) for s in brown.sents()])
unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i inrange(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i inrange(len(text)-3))
weights = [0.001, 0.01, 0.989]
defstrangeness(text):
r = 0
text = ' ' + text + '\n'for i inrange(2, len(text)):
char = text[i]
context1 = text[(i-1):i]
context2 = text[(i-2):i]
num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2]
den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
r -= np.log(num / den)
return r / (len(text) - 2)
Now you can apply this strangeness measure to your examples.
t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
print('{:20} -> {:9.5}'.format(t, strangeness(t)))
You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.
128->5.5528127->5.6572
h4rugz4sx383a6n64hpo ->5.9016
tt ->4.9392
t66 ->6.9673
t65 ->6.8501
asdfds ->3.9776
Michael ->3.3598
sara ->3.8171
jose colmenares ->2.9539
Dimitar ->3.4602
Jose Rafael ->3.4604
Morgan ->3.3628
Eduardo Medina ->3.2586
Luis R. Mendez ->3.566
Hikaru ->3.8936
SELENIA ->6.1829
Zhang Ming ->3.4809
Xuting Liu ->3.7161
Chen Zheng ->3.6212
Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.
Post a Comment for "Unable To Detect Gibberish Names Using Python"