Skip to content Skip to sidebar Skip to footer

Python Gensim Word2vec Vocabulary Key

I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode. # -*- encoding:utf-8 -*- # !/usr/bin/env python import sys reloa

Solution 1:

Word2Vec requires text examples that are broken into word-tokens. It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words.

Does Korean use spaces to delimit words? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec.

If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec.

Post a Comment for "Python Gensim Word2vec Vocabulary Key"