How To Incrementally Train A Word2vec Model With New Vocabularies
I got a dataset over 40G. The program of my tokenizer is killed due to limited memory, so I try to split my dataset. How can I train the word2vec model incrementally, that is, how
Solution 1:
I have found the solution: use PathLineSentences
. It is very fast. Incrementally training a word2vec model cannot learn new vocabularies, but PathLineSentences
can.
from gensim.models.word2vec import PathLineSentences
model = Word2Vec(PathLineSentences(input_dir), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)
For single file, use LineSentences
.
from gensim.models.word2vec import LineSentence
model = Word2Vec(LineSentence(file), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)
...
Post a Comment for "How To Incrementally Train A Word2vec Model With New Vocabularies"