常用API
- gensim.models.Word2Vec(sentence, min_count, workers)
- gensim.models.word2vec.Word2Vec(sentence, min_count, workers)
word2vec参数
- sentence:语料句子,必须是一个可迭代的对象
- min_counts:指定了需要训练的词语最小出现次数,小于该值的词将被忽略
- max_vocab_size:最大词汇数,防止内存溢出
- size:词向量维度
- alpha:训练的初始学习率,随着训练的进行,学习率会线性减少
- min_alpha:最小学习率
- window:滑动窗口大小
- sg:训练模型(0:CBOW;1:skip-gram)
- hs:word2vec两个解法的选择了,如果是0, 则是Negative Sampling,是1的话并且负采样个数negative大于0, 则是Hierarchical Softmax。默认是0即Negative Sampling。
- iter:迭代次数
加载语料库
自己构建
形如:sentence=[['ab', 'ba'], ['sheu', 'dhudhi', 'hdush'], ... []]
加载单文件语料
使用LineSentence()函数,文件必须是已经分词过了的。
加载文件夹下的所有语料
使用PathLineSentence()函数,文件必须是已经分词过了的。
自定义
class MySentence:
def __init__(self, data_path, max_line=None):
self.data_path = data_path
self.max_line = max_line
self.cur_line = 0
def __iter__(self):
if self.max_line is not None:
for line in open(self.data_path, 'r', encoding='utf-8'):
if self.cur_line >= self.max_line:
return
self.cur_line += 1
yield line.strip('\n').split()
else:
for line in open(self.data_path, 'r', encoding='utf-8'):
yield line.strip('\n').split()
上述代码自定义了一个MySentence类,它的实例是一个可迭代的对象,可以直接传给LineSentence()函数。
训练模型
from gensim.models import word2vec
ms = MySentence(data_path)
model = word2vec.Word2Vec(ms, hs=1, min_count=1, window=3, size=64)
假如要在已训练好的模型上追加训练:
先加载模型:
model = word2vec.Word2Vec.load(model_path)
再追加训练
model.train(other_sentence)
存储模型
- model.save(model_name),可以追加训练
- model.save_word2vec_format(model_name),不可以追加训练
加载模型
方法一:
model = word2vec.Word2Vec.load(model_path)
方法二:
model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=False) # C text format
model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True) # C binary format
获取词向量
word_vec = model.wv[word]
示例
import jieba
import jieba.analyse
from gensim.models import word2vec
def cut_words():
jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('田国富', True)
jieba.suggest_freq('高育良', True)
jieba.suggest_freq('侯亮平', True)
jieba.suggest_freq('钟小艾', True)
jieba.suggest_freq('陈岩石', True)
jieba.suggest_freq('欧阳菁', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('蔡成功', True)
jieba.suggest_freq('孙连城', True)
jieba.suggest_freq('季昌明', True)
jieba.suggest_freq('丁义珍', True)
jieba.suggest_freq('郑西坡', True)
jieba.suggest_freq('赵东来', True)
jieba.suggest_freq('高小琴', True)
jieba.suggest_freq('赵瑞龙', True)
jieba.suggest_freq('林华华', True)
jieba.suggest_freq('陆亦可', True)
jieba.suggest_freq('刘新建', True)
jieba.suggest_freq('刘庆祝', True)
with open('./in_the_name_of_people.txt', 'r', encoding='utf-8') as f:
document = f.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
with open('./in_the_name_of_people_segment.txt', 'w', encoding='utf-8') as f2:
f2.write(result)
print('ok')
class MySentence:
def __init__(self, data_path, max_line=None):
self.data_path = data_path
self.max_line = max_line
self.cur_line = 0
def __iter__(self):
if self.max_line is not None:
for line in open(self.data_path, 'r', encoding='utf-8'):
if self.cur_line >= self.max_line:
return
self.cur_line += 1
yield line.strip('\n').split()
else:
for line in open(self.data_path, 'r', encoding='utf-8'):
yield line.strip('\n').split()
def word_embedding():
ms = MySentence('./in_the_name_of_people_segment.txt')
model = word2vec.Word2Vec(ms, hs=1, min_count=1, window=3, size=64)
model.save('./name_of_people_wv.model')
print('ok')
def load_model():
model = word2vec.Word2Vec.load('./name_of_people_wv.model')
words = ['侯亮平', '蓦地', '睁开眼睛', '。', '大厅', '突起', '一阵', '骚动', '许多', '人', '拥向', '不同', '的', '登机口']
for word in words:
print(model.wv[word])
if __name__ == "__main__":
word_embedding()
ms = MySentence('./in_the_name_of_people_segment.txt')
load_model()