GenSim——NLP工具

GenSim——NLP工具

GenSim is an open source python library for nlp modelling. API online docs

(from official site:)

GenSim: topic modelling for humans.
Train large-scale semantic nlp models.
Represent text as semantic vectors.
Find semantically related documents.

corpura.Dictionary

Methods:

  • doc2bow() represent a doc as a vector using BOW model
  • token2id (attribute) dict from token to int
  • filter_tokens(tokens) remove token from the dictionary
  • compactify() remove gaps in id sequence after words that were removed
corpura.Dictionary(*)

Models

Word2Vec, word2vec model

gensim.models.Word2Vec

  • Word2Vec(sentences=)
  • [word]: <vector>
  • most_similar(positive=, topn=): list[str]
  • save()
  • Word2Vec.load() class method.

gensim.models.KeyedVectors.load_word2vec_format(<filepath>, binary):

  • <filepath> path to word2vec model file
  • binary value from True or False. True for word2vec model file with binary format. False for text format.

Train

custom preprocessing: custom class with __iiter():list[str] yielding a list of tokens.

gensim.models.Word2Vec() return a word2vec model instance which is trained from given corpus. parameters:

  • sentences: iterator[list[str]]
  • min_count
  • vector_size dimension of vector
  • workers:int training parallelism.
  • ...

KeyedVectors

  • load(fname_or_handle) load from file
  • load_word2vec_format() from file with C format
  • vector_size dimension
  • ['word'] get vector
  • vocab get vocabulary
from gensim.models import KeyedVectors
model = KeyedVectors.load('/path/to/w2v.model')       # load from file with gensim format
KeyedVectors.load_word2vec_format('/path/to/w2v.model') # load from a keyed vec file
KeyedVectors.load_word2vec_format('/path/to/w2v.bin', binary=True)   # load from a binary format

vec = model['word']     # query vector for a word
dim = model.vector_size # get the dimension of vectors
if my_word in model.vocab:      # test word existing (oov)
if my_word in model.wv.vocab:       # for gensim 2.x

Creating an of Word2Vec from a python dict:

import numpy as np
d = dict()
d['my_word1'] = np.random.randn(300)
d['my_word2'] = np.random.randn(300)


from gensim.models.keyedvectors import Vocab
from gensim.models import KeyedVectors
import tempfile

word_list, vector_list = zip(*d.items())
m = KeyedVectors(vector_size = 0)
m.vocab = dict()
m.vectors = np.array(vector_list)
for i in range(len(vector_list)):
    m.vocab[word_list[i]] = Vocab(index = i, count = 1)
    
with tempfile.NamedTemporaryFile(delete = False) as f:
    m.save_word2vec_format(f, binary = True)
    tempfilew2v = f.name
m = KeyedVectors.load_word2vec_format(tempfilew2v, binary = True)
m.most_similar('my_word1')

# remove temp file
# import os
# os.remove(tempfilew2v)

TfidfModel

models.TfidfModel(), parameters:

  • corpus
  • normalize

LsiModel, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)

RpModel, Random Porjections

LdaModel, Latent Dirichlet Allocation, LDA

HdpModel, Hierarchical Dirichlet Process, HDP

Doc2Vec, doc2vec (paragraph to vector) model

gensim.models.doc2vec.Doc2Vec

FastText, fasttext model

gensim.models.fasttext.FastText

gensim.models.LdaModel, LDA model

Corpora Formats

  • MmCorpus Matrix Market
  • SvmLightCorpus Joachim’s SVMlight format
  • BleiCorpus Blei’s LDA-C format
  • LowCorpus GibbsLDA++ format

Install

pip install gensim
# or using conda
conda install gensim 

Bugs

Failed to support for generator of corpus

from gensim.models import Word2Vec
Word2Vec(create_generator(), vector_size=SIZE)    # ERROR, a generator cannot be consumed for 2 or more passes

#------
w2v = Word2Vec(vector_size=SIZE)
w2v.train(create_generator())           # ERROR, must build vocabulary first

in version 4.0.1: A TypeError will be raised from Word2Vec._check_corpus_sanity() if passing a generator to Word2Vec as sentences. The error says TypeError: Using a generator as corpus_iterable can't support 6 passes. Try a re-iterable sequence. from the check if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:. A generator cannot be read for 2 or more passes, which make sense. However, here 6 comes from the default epochs=5 plus 1 (1 extra pass to build vocabulary before training), self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=(epochs + 1)), code location: word2vec.py:417(Gensim 4.0.1), and epochs cannot be 0, or the ._check_training_sanity() will raise an error.

posted @ 2022-07-06 23:26  二球悬铃木  阅读(146)  评论(0编辑  收藏  举报