Skip to content Skip to sidebar Skip to footer

Grouping Similar Strings

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have sim

Solution 1:

You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).

nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.

import sys, operator

def tokenize(s, glen):
  g2 = set()
  for i in xrange(len(s)-(glen-1)):
    g2.add(s[i:i+glen])
  return g2

def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))

def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))

def main():
  GRAM_LEN = 4
  scores = {}
  for i in xrange(1,len(sys.argv)):
    for j in xrange(i+1, len(sys.argv)):
      s1 = sys.argv[i]
      s2 = sys.argv[j]
      score = dice(GRAM_LEN, s1, s2)
      scores[s1+":"+s2] = score
  for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
    print item

When this program is run with your strings, the following similarity scores are produced:

./dice.py "NBA Basketball""Basketball NBA""Basketball""Baseball"

('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)

At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

Post a Comment for "Grouping Similar Strings"