Error In Extracting Phrases Using Gensim
Solution 1:
The technique used by gensim Phrases
is purely based on statistics of co-occurrences: how often words appear together, versus alone, in a formula also affected by min_count
and compared against the threshold
value.
It is only because your training set has 'new' and 'york' occur alongside each other twice, while other words (like 'machine' and 'learning') only occur alongside each other once, that 'new_york' becomes a bigram, and other pairings do not. What's more, even if you did find a combination of min_count
and threshold
that would promote 'machine_learning' to a bigram, it would also pair together every other bigram-that-appears-once – which is probably not what you want.
Really, to get good results from these statistical techniques, you need lots of varied, realistic data. (Toy-sized examples may superficially succeed, or fail, for superficial toy-sized reasons.)
Even then, they will tend to miss combinations a person would consider reasonable, and make combinations a person wouldn't. Why? Because our minds have much more sophisticated ways (including grammar and real-world knowledge) for deciding when clumps of words represent a single concept.
So even with more better data, be prepared for nonsensical n-grams. Tune or judge the model on whether it is overall improving on your goal, not any single point or ad-hoc check of matching your own sensibility.
(Regarding the referenced gensim documentation comment, I'm pretty sure that if you try Phrases
on just the two sentences listed there, it won't find any of the desired phrases – not 'new_york' or 'machine_learning'. As a figurative example, the ellipses ...
imply the training set is larger, and the results indicate that the extra unshown texts are important. It's just because of the 3rd sentence you've added to your code that 'new_york' is detected. If you added similar examples to make 'machine_learning' look more like a statistically-outlying pairing, your code could promote 'machine_learning', too.)
Solution 2:
It is probably below your threshold
?
Try using more data.
Post a Comment for "Error In Extracting Phrases Using Gensim"