Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy

April 01, 2024 Post a Comment

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings. To do this

Solution 1:

see this post by the author of Spacy which says:

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.

Try this for example:

import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np

sentence = "I love Stack Overflow butitsalsodistractive"print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])


np.array_equal(tokens.vector, tokensClean.vector)
#False

If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)

Baca Juga

Getting Started with Python

Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy

Solution 1:

Post a Comment for "Ignore Out-of-vocabulary Words When Averaging Vectors In Spacy"