Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy
I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings. To do this
Solution 1:
see this post by the author of Spacy which says:
The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.
Try this for example:
import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np
sentence = "I love Stack Overflow butitsalsodistractive"print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])
np.array_equal(tokens.vector, tokensClean.vector)
#False
If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)
Post a Comment for "Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy"