Skip to content Skip to sidebar Skip to footer

Error In Extracting Phrases Using Gensim

I am trying to get the bigrams in the sentences using Phrases in Gensim as follows. from gensim.models import Phrases from gensim.models.phrases import Phraser documents = ['the ma

Solution 1:

The technique used by gensim Phrases is purely based on statistics of co-occurrences: how often words appear together, versus alone, in a formula also affected by min_count and compared against the threshold value.

It is only because your training set has 'new' and 'york' occur alongside each other twice, while other words (like 'machine' and 'learning') only occur alongside each other once, that 'new_york' becomes a bigram, and other pairings do not. What's more, even if you did find a combination of min_count and threshold that would promote 'machine_learning' to a bigram, it would also pair together every other bigram-that-appears-once – which is probably not what you want.

Really, to get good results from these statistical techniques, you need lots of varied, realistic data. (Toy-sized examples may superficially succeed, or fail, for superficial toy-sized reasons.)

Even then, they will tend to miss combinations a person would consider reasonable, and make combinations a person wouldn't. Why? Because our minds have much more sophisticated ways (including grammar and real-world knowledge) for deciding when clumps of words represent a single concept.

So even with more better data, be prepared for nonsensical n-grams. Tune or judge the model on whether it is overall improving on your goal, not any single point or ad-hoc check of matching your own sensibility.

(Regarding the referenced gensim documentation comment, I'm pretty sure that if you try Phrases on just the two sentences listed there, it won't find any of the desired phrases – not 'new_york' or 'machine_learning'. As a figurative example, the ellipses ... imply the training set is larger, and the results indicate that the extra unshown texts are important. It's just because of the 3rd sentence you've added to your code that 'new_york' is detected. If you added similar examples to make 'machine_learning' look more like a statistically-outlying pairing, your code could promote 'machine_learning', too.)

Solution 2:

It is probably below your threshold?

Try using more data.

Post a Comment for "Error In Extracting Phrases Using Gensim"