Skip to content Skip to sidebar Skip to footer

Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings. To do this

Solution 1:

see this post by the author of Spacy which says:

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.

Try this for example:

import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np

sentence = "I love Stack Overflow butitsalsodistractive"print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])


np.array_equal(tokens.vector, tokensClean.vector)
#False

If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)

Post a Comment for "Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy"