Skip to content Skip to sidebar Skip to footer

How To Incrementally Train A Word2vec Model With New Vocabularies

I got a dataset over 40G. The program of my tokenizer is killed due to limited memory, so I try to split my dataset. How can I train the word2vec model incrementally, that is, how

Solution 1:

I have found the solution: use PathLineSentences. It is very fast. Incrementally training a word2vec model cannot learn new vocabularies, but PathLineSentences can.

from gensim.models.word2vec import PathLineSentences

model = Word2Vec(PathLineSentences(input_dir), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)

For single file, use LineSentences.

from gensim.models.word2vec import LineSentence

model = Word2Vec(LineSentence(file), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)
...

Post a Comment for "How To Incrementally Train A Word2vec Model With New Vocabularies"