GenSim——NLP工具

GenSim is an open source python library for nlp modelling. API online docs

(from official site:)

GenSim: topic modelling for humans.
Train large-scale semantic nlp models.
Represent text as semantic vectors.
Find semantically related documents.

`corpura.Dictionary`

Methods:

doc2bow() represent a doc as a vector using BOW model
token2id (attribute) dict from token to int
filter_tokens(tokens) remove token from the dictionary
compactify() remove gaps in id sequence after words that were removed

corpura.Dictionary(*)

Models

`Word2Vec`, word2vec model

gensim.models.Word2Vec

Word2Vec(sentences=)
[word]: <vector>
most_similar(positive=, topn=): list[str]
save()
Word2Vec.load() class method.

gensim.models.KeyedVectors.load_word2vec_format(<filepath>, binary):

<filepath> path to word2vec model file
binary value from True or False. True for word2vec model file with binary format. False for text format.

Train

custom preprocessing: custom class with __iiter():list[str] yielding a list of tokens.

gensim.models.Word2Vec() return a word2vec model instance which is trained from given corpus. parameters:

sentences: iterator[list[str]]
min_count
vector_size dimension of vector
workers:int training parallelism.
...

KeyedVectors

load(fname_or_handle) load from file
load_word2vec_format() from file with C format
vector_size dimension
['word'] get vector
vocab get vocabulary

from gensim.models import KeyedVectors
model = KeyedVectors.load('/path/to/w2v.model')       # load from file with gensim format
KeyedVectors.load_word2vec_format('/path/to/w2v.model') # load from a keyed vec file
KeyedVectors.load_word2vec_format('/path/to/w2v.bin', binary=True)   # load from a binary format

vec = model['word']     # query vector for a word
dim = model.vector_size # get the dimension of vectors
if my_word in model.vocab:      # test word existing (oov)
if my_word in model.wv.vocab:       # for gensim 2.x

Creating an of Word2Vec from a python dict:

import numpy as np
d = dict()
d['my_word1'] = np.random.randn(300)
d['my_word2'] = np.random.randn(300)


from gensim.models.keyedvectors import Vocab
from gensim.models import KeyedVectors
import tempfile

word_list, vector_list = zip(*d.items())
m = KeyedVectors(vector_size = 0)
m.vocab = dict()
m.vectors = np.array(vector_list)
for i in range(len(vector_list)):
    m.vocab[word_list[i]] = Vocab(index = i, count = 1)
    
with tempfile.NamedTemporaryFile(delete = False) as f:
    m.save_word2vec_format(f, binary = True)
    tempfilew2v = f.name
m = KeyedVectors.load_word2vec_format(tempfilew2v, binary = True)
m.most_similar('my_word1')

# remove temp file
# import os
# os.remove(tempfilew2v)

`TfidfModel`

models.TfidfModel(), parameters:

corpus
normalize

`LsiModel`, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)

`RpModel`, Random Porjections

`LdaModel`, Latent Dirichlet Allocation, LDA

`HdpModel`, Hierarchical Dirichlet Process, HDP

`Doc2Vec`, doc2vec (paragraph to vector) model

gensim.models.doc2vec.Doc2Vec

`FastText`, fasttext model

gensim.models.fasttext.FastText

`gensim.models.LdaModel`, LDA model

Corpora Formats

MmCorpus Matrix Market
SvmLightCorpus Joachim’s SVMlight format
BleiCorpus Blei’s LDA-C format
LowCorpus GibbsLDA++ format

Install

pip install gensim
# or using conda
conda install gensim

Bugs

Failed to support for generator of corpus


from gensim.models import Word2Vec
Word2Vec(create_generator(), vector_size=SIZE)    # ERROR, a generator cannot be consumed for 2 or more passes

#------
w2v = Word2Vec(vector_size=SIZE)
w2v.train(create_generator())           # ERROR, must build vocabulary first

in version 4.0.1: A TypeError will be raised from Word2Vec._check_corpus_sanity() if passing a generator to Word2Vec as sentences. The error says TypeError: Using a generator as corpus_iterable can't support 6 passes. Try a re-iterable sequence. from the check if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:. A generator cannot be read for 2 or more passes, which make sense. However, here 6 comes from the default epochs=5 plus 1 (1 extra pass to build vocabulary before training), self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=(epochs + 1)), code location: word2vec.py:417(Gensim 4.0.1), and epochs cannot be 0, or the ._check_training_sanity() will raise an error.

posted @ 2022-07-06 23:26 二球悬铃木阅读(146) 评论(0) 编辑收藏举报

刷新页面返回顶部

二球悬铃木

GenSim——NLP工具

GenSim——NLP工具

corpura.Dictionary

Models

Word2Vec, word2vec model

KeyedVectors

TfidfModel

LsiModel, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)

RpModel, Random Porjections

LdaModel, Latent Dirichlet Allocation, LDA

HdpModel, Hierarchical Dirichlet Process, HDP

Doc2Vec, doc2vec (paragraph to vector) model

FastText, fasttext model

gensim.models.LdaModel, LDA model

Corpora Formats

Install

Bugs

Failed to support for generator of corpus

公告

`corpura.Dictionary`

`Word2Vec`, word2vec model

`TfidfModel`

`LsiModel`, Latent Semantic Indexing (or Latent Semantic Analysis, LSA)

`RpModel`, Random Porjections

`LdaModel`, Latent Dirichlet Allocation, LDA

`HdpModel`, Hierarchical Dirichlet Process, HDP

`Doc2Vec`, doc2vec (paragraph to vector) model

`FastText`, fasttext model

`gensim.models.LdaModel`, LDA model