GenSim——NLP工具
GenSim——NLP工具
GenSim is an open source python library for nlp modelling. API online docs
(from official site:)
GenSim: topic modelling for humans.
Train large-scale semantic nlp models.
Represent text as semantic vectors.
Find semantically related documents.
corpura.Dictionary
Methods:
doc2bow()
represent a doc as a vector using BOW modeltoken2id
(attribute) dict from token to intfilter_tokens(tokens)
remove token from the dictionarycompactify()
remove gaps in id sequence after words that were removed
corpura.Dictionary(*)
Models
Word2Vec
, word2vec model
gensim.models.Word2Vec
Word2Vec(sentences=)
[word]: <vector>
most_similar(positive=, topn=): list[str]
save()
Word2Vec.load()
class method.
gensim.models.KeyedVectors.load_word2vec_format(<filepath>, binary)
:
<filepath>
path to word2vec model filebinary
value fromTrue
orFalse
.True
for word2vec model file with binary format.False
for text format.
Train
custom preprocessing: custom class with __iiter():list[str]
yielding a list of tokens.
gensim.models.Word2Vec()
return a word2vec model instance which is trained from given corpus. parameters:
sentences: iterator[list[str]]
min_count
vector_size
dimension of vectorworkers:int
training parallelism.- ...
KeyedVectors
load(fname_or_handle)
load from fileload_word2vec_format()
from file with C formatvector_size
dimension['word']
get vectorvocab
get vocabulary
from gensim.models import KeyedVectors
model = KeyedVectors.load('/path/to/w2v.model') # load from file with gensim format
KeyedVectors.load_word2vec_format('/path/to/w2v.model') # load from a keyed vec file
KeyedVectors.load_word2vec_format('/path/to/w2v.bin', binary=True) # load from a binary format
vec = model['word'] # query vector for a word
dim = model.vector_size # get the dimension of vectors
if my_word in model.vocab: # test word existing (oov)
if my_word in model.wv.vocab: # for gensim 2.x
Creating an of Word2Vec
from a python dict:
import numpy as np
d = dict()
d['my_word1'] = np.random.randn(300)
d['my_word2'] = np.random.randn(300)
from gensim.models.keyedvectors import Vocab
from gensim.models import KeyedVectors
import tempfile
word_list, vector_list = zip(*d.items())
m = KeyedVectors(vector_size = 0)
m.vocab = dict()
m.vectors = np.array(vector_list)
for i in range(len(vector_list)):
m.vocab[word_list[i]] = Vocab(index = i, count = 1)
with tempfile.NamedTemporaryFile(delete = False) as f:
m.save_word2vec_format(f, binary = True)
tempfilew2v = f.name
m = KeyedVectors.load_word2vec_format(tempfilew2v, binary = True)
m.most_similar('my_word1')
# remove temp file
# import os
# os.remove(tempfilew2v)
TfidfModel
models.TfidfModel()
, parameters:
corpus
normalize
LsiModel
, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)
RpModel
, Random Porjections
LdaModel
, Latent Dirichlet Allocation, LDA
HdpModel
, Hierarchical Dirichlet Process, HDP
Doc2Vec
, doc2vec (paragraph to vector) model
gensim.models.doc2vec.Doc2Vec
FastText
, fasttext model
gensim.models.fasttext.FastText
gensim.models.LdaModel
, LDA model
Corpora Formats
MmCorpus
Matrix MarketSvmLightCorpus
Joachim’s SVMlight formatBleiCorpus
Blei’s LDA-C formatLowCorpus
GibbsLDA++ format
Install
pip install gensim
# or using conda
conda install gensim
Bugs
Failed to support for generator of corpus
from gensim.models import Word2Vec
Word2Vec(create_generator(), vector_size=SIZE) # ERROR, a generator cannot be consumed for 2 or more passes
#------
w2v = Word2Vec(vector_size=SIZE)
w2v.train(create_generator()) # ERROR, must build vocabulary first
in version 4.0.1: A TypeError
will be raised from Word2Vec._check_corpus_sanity()
if passing a generator to Word2Vec
as sentences. The error says TypeError: Using a generator as corpus_iterable can't support 6 passes. Try a re-iterable sequence.
from the check if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:
. A generator cannot be read for 2 or more passes, which make sense. However, here 6
comes from the default epochs=5
plus 1 (1 extra pass to build vocabulary before training), self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=(epochs + 1))
, code location: word2vec.py:417
(Gensim 4.0.1), and epochs
cannot be 0, or the ._check_training_sanity()
will raise an error.