gensim中常用的Word2Vec，Phrases，Phraser，KeyedVectors

1. Phrases 和Phraser

gensim.models.phrases.Phrases 和gensim.models.phrases.Phraser的用处是从句子中自动检测常用的短语表达，N-gram多元词组。Phrases模型可以构建和实现bigram，trigram，quadgram等，提取文档中经常出现的2个词，3个词，4个词。

具体可以查看官网gensim.models.phrases.Phrases API, tutorial

两者不同：

Phrases基于共现频率提取bigram词组。基于共现的统计：受min_count与threshold的影响，参数设置越大，单词组合成二元词组的难度越大。

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))
Phraser的目的是通过丢弃一些Phasers模型（phrases_model）状态，减少短语的内存消耗。如果后面不需要用新文档更新bigram统计信息，就可以使用Phraser代替Phrases。

一次性初始化后，Phraser会比使用Phrases模型占用内存小且速度更快。

class gensim.models.phrases.Phraser(phrases_model)
- Parameters
  
  phrases_model (Phrases) – Trained phrases instance.

Example：

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> # Load training data.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> # The training corpus must be a sequence (stream, generator) of sentences,
>>> # with each sentence a list of tokens:
>>> print(list(sentences)[0][:10])
['computer', 'human', 'interface', 'computer', 'response', 'survey', 'system', 'time', 'user', 'interface']
>>>
>>> # Train a toy bigram model.
>>> phrases = Phrases(sentences, min_count=1, threshold=1)
>>> # Apply the trained phrases model to a new, unseen sentence.
>>> phrases[['trees', 'graph', 'minors']]
['trees_graph', 'minors']
>>> # The toy model considered "trees graph" a single phrase => joined the two
>>> # tokens into a single token, `trees_graph`.
>>>
>>> # Update the model with two new sentences on the fly.
>>> phrases.add_vocab([["hello", "world"], ["meow"]])
>>>
>>> # Export the trained model = use less RAM, faster processing. Model updates no longer possible.
>>> bigram = Phraser(phrases)
>>> bigram[['trees', 'graph', 'minors']]  # apply the exported model to a sentence
['trees_graph', 'minors']
>>>
>>> # Apply the exported model to each sentence of a corpus:
>>> for sent in bigram[sentences]:
...     pass
>>>
>>> # Save / load an exported collocation model.
>>> bigram.save("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded = Phraser.load("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded[['trees', 'graph', 'minors']]  # apply the exported model to a sentence
['trees_graph', 'minors']

2. Word2Vec

涵盖常用的word2vec算法skip-gram和CBOW，使用hierarchical softmax和负采样。

注意：官网里Word2Vec函数的vector_size参数应该已经换成了size，在使用时用size.

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None)

常用的几个方式：

普通嵌入：Word2Vec要求输入是可迭代的有序句子列表，即是已经分词的句子列表，即某个二维数组。不过在数据集较大的情况会占用大量RAM，可以使用自定义的生成器，只在内存中保存单条语句。

from gensim.models import Word2Vec
# load data
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
# words cut 
sentences= [s.split() for s in raw_sentences]
# initialize and train a word2vec model
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=4)
# save model
model.save("word2vec.model")

调用Word2Vec创建模型实际上会对数据执行两次迭代操作，第一轮操作会统计词频来构建内部的词典数结构，第二轮操作会进行神经网络训练，而这两个步骤是可以分步进行的，这样对于某些不可重复的流（譬如 Kafka 等流式数据中）可以手动控制:

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
# initialize model
model = Word2Vec(size=300, window=5, min_count=1, workers=4)
# prepare the model vocabulary
model.build_vocab(sentences)  
# train word vectors
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

多词嵌入：利用gensim.models.phrases模块，获取二元词组，word2vec模型可以学习短语表达，比如new_york，machine_learning。

from gensim.test.utils import common_texts
from gensim.models import Phrases, Phraser
# Train a bigram detector.
phrases = Phrases(common_texts)
bigram = Phraser(phrases)
sentences = bigram[common_texts]
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(sentences, min_count=1, size=300, window=5, iter=7)

预训练模型：Gensim数据存储库中附带了几个已经过预训练的模型：Gensim-data repository。对于中文来说，gensim-data没有中文数据，更需要自己找语料库训练模型.

import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
#>>>['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')

3. KeyedVector

gensim.models.keyedvectors模块实现词向量的保存和各种相似性查询。由于训练后的词向量与训练方式无关，因此可以用一个独立结构来表示。这个结构叫做 “KeyedVectors”，本质上是键和向量之间的映射。每个向量都由它的查找键来标识。

参考API：KeyedVector

class gensim.models.keyedvectors.KeyedVectors(vector_size, count=0, dtype=<class 'numpy.float32'>, mapfile_path=None)

classmethod load(fname, mmap=None)

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)

save(args, kwargs)：Save KeyedVectors to a file.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')：Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

KeyedVector和完整模型的主要区别：KeyedVectors不支持进一步的训练。另一方面，通过舍弃训练所需的内部数据结构，KeyedVectors提供了更小的RAM占用空间和更简单的接口。

词向量保存的各种数据格式

一个是词向量保存的格式有好几种，大小会不同嘛，哪个速度快，选哪个呢？另外保存时使用二进制和非二进制，在读取词向量时可能带来编码问题。

训练一个完整的模型，然后访问其model.wv属性，该属性持有独立的键控向量。

词向量获取：保留在磁盘上，词向量从磁盘上现有的文件中实例化，以原始Google的word2vec C格式作为KeyedVectors实例。

model.save()不能利用文本编辑器查看，但是保存了训练的全部信息，可以在读取后追加训练;
model.wv.save_word2vec_format()保存为word2vec文本格式，但是保存时丢失了词汇树等部分信息，不能追加训练。

from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors

# initialize and train a word2vec model
model = Word2Vec(common_texts, size=300, window=5, min_count=1, workers=4)
# save model
model.save("word2vec.model")
# 1.保存为KeyedVector文件
model.wv.save('vectors_300.kv')
# 2.将词向量保存为txt格式/非二进制
model.wv.save_word2vec_format('token_vec_300.txt', binary=False)
# 3.将词向量保存为bin格式/二进制
model.wv.save_word2vec_format('token_vec_300.bin', binary=True)
# 4.将词向量保存为bin.gz格式/压缩二进制
model.wv.save_word2vec_format('token_vec_300.bin.gz', binary=True)

# 1. 加载KeyedVectors实例
keyed_vectors = KeyedVectors.load('vectors_300.kv')
# 2. 加载词向量txt格式/非二进制
wv_from_text = KeyedVectors.load_word2vec_format('token_vec_300.txt', binary=False) 
# 3. 加载词向量bin格式/二进制
wv_from_bin = KeyedVectors.load_word2vec_format('token_vec_300.bin', binary=True)
# 4. 加载词向量bin.gz格式/压缩二进制
wv_from_bin_gz = KeyedVectors.load_word2vec_format('token_vec_300.bin.gz', binary=True)

模型保存/词向量保存格式	大小
model.save("word2vec.model")	48KB
model.wv.save('vectors_300.kv')	23KB
model.wv.save_word2vec_format('token_vec_300.txt', binary=False)	49KB
model.wv.save_word2vec_format('token_vec_300.bin', binary=True)	15KB
model.wv.save_word2vec_format('token_vec_300.bin.gz', binary=True)	13KB

总结：词向量还是以bin或bin.gz格式保存为佳，占用空间少。

词向量用处

用训练好的向量执行各种句法/语义NLP词任务。

import gensim.downloader as api
# load pre-trained word-vectors from gensim-data
word_vectors = api.load("glove-wiki-gigaword-100")  
# Check the "most similar words", using the default "cosine similarity" measure.
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")
#>>> queen: 0.7699

posted @ 2021-04-18 23:51 MissHsu 阅读(1771) 评论(0) 编辑收藏举报

刷新页面返回顶部

SOULXU

gensim中常用的Word2Vec，Phrases，Phraser，KeyedVectors

gensim中常用的Word2Vec，Phrases，Phraser，KeyedVectors

1. Phrases 和Phraser

2. Word2Vec

3. KeyedVector

词向量保存的各种数据格式

词向量用处

公告