Youth

Embedding和Word2Vec用法

Embedding

Embedding 层的输入是一个二维整数张量, 形状为(samples,sequence_length),即(样本数,序列长度)

较短的序列应该用 0 填充,较长的序列应该被截断,保证输入的序列长度是相同的

Embedding 层输出是(samples,sequence_length,embedding_dimensionality) 的三维浮点数张量。

  • 首先,我们需要对文本进行分词处理,然后对分词结果进行序列化
  • 再统一输入的序列长度,最后把统一长度的序列化结果输入到 Embedding 层中

整个过程可以用下面的图描述:

 

从样本的角度看,我们可以用下面的图描述这个过程:

 

gensim库提供了一个word2vec的实现,我们使用几个API就可以方便地完成word2vec

gensim实现Word2Vec

示意代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
<code-pre class="code-pre code-pre-line" id="pre-bSn5x8"><code-line class="line-numbers-rows"></code-line>from torch import nn
<code-line class="line-numbers-rows"></code-line>from gensim.models import Word2Vec
<code-line class="line-numbers-rows"></code-line># w2v模型位置
<code-line class="line-numbers-rows"></code-line>w2v_path = "model/w2v_all.model"
<code-line class="line-numbers-rows"></code-line>class Preprocess():
<code-line class="line-numbers-rows"></code-line>  def __init__(self,sentences,sen_len,w2v_path=w2v_path):
<code-line class="line-numbers-rows"></code-line>    self.w2v_path = w2v_path
<code-line class="line-numbers-rows"></code-line>    self.sentences = sentences
<code-line class="line-numbers-rows"></code-line>    self.sen_len = sen_len
<code-line class="line-numbers-rows"></code-line>    self.idx2word = []
<code-line class="line-numbers-rows"></code-line>    self.word2idx = {}
<code-line class="line-numbers-rows"></code-line>    self.embedding_matrix = []
<code-line class="line-numbers-rows"></code-line> 
<code-line class="line-numbers-rows"></code-line>  def get_w2v_model(self):
<code-line class="line-numbers-rows"></code-line>    # 把之前保存的w2v模型加载进来
<code-line class="line-numbers-rows"></code-line>    self.embedding = Word2Vec.load(self.w2v_path)
<code-line class="line-numbers-rows"></code-line>    self.embedding_dim = self.embedding.vector_size
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>  def add_embedding(self,word):
<code-line class="line-numbers-rows"></code-line>    # 把word加进embedding  并赋值一个随机向量
<code-line class="line-numbers-rows"></code-line>    vector = torch.empty(1,self.embedding_dim) #return a tensor filled with uninitialed dada. shape is  (1*embedding_dim)
<code-line class="line-numbers-rows"></code-line>    #从均匀分布U(a,b)中生成值,填充参数vector,默认a=0,b=1
<code-line class="line-numbers-rows"></code-line>    torch.nn.init.uniform_(vector)
<code-line class="line-numbers-rows"></code-line>    self.word2idx[word] = len(self.word2idx)  #为word2idx字典填充后,word2idx长度会加1
<code-line class="line-numbers-rows"></code-line>    self.idx2word.append(word)
<code-line class="line-numbers-rows"></code-line>    print("word:",word)
<code-line class="line-numbers-rows"></code-line>    # torch.cat是将两个张量(tensor)拼接在一起    按维数0拼接(竖着拼)
<code-line class="line-numbers-rows"></code-line>    self.embedding_matrix = torch.cat([self.embedding_matrix,vector],0)
<code-line class="line-numbers-rows"></code-line>    print("embedding_matrix.shape",self.embedding_matrix.shape)
<code-line class="line-numbers-rows"></code-line> 
<code-line class="line-numbers-rows"></code-line>  def make_embedding(self,load=True):
<code-line class="line-numbers-rows"></code-line>    print("get embedding..")
<code-line class="line-numbers-rows"></code-line>    #加载embedding模型
<code-line class="line-numbers-rows"></code-line>    if load:
<code-line class="line-numbers-rows"></code-line>      print("加载word to vec模型")
<code-line class="line-numbers-rows"></code-line>      self.get_w2v_model()
<code-line class="line-numbers-rows"></code-line>    else:
<code-line class="line-numbers-rows"></code-line>      raise NotImplementedError
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>    # 制作一个word2idx的字典
<code-line class="line-numbers-rows"></code-line>    # 制作一个idx2word的list
<code-line class="line-numbers-rows"></code-line>    # 制作一个word2vector的list
<code-line class="line-numbers-rows"></code-line>    for i,word in enumerate(self.embedding.wv.key_to_index ):
<code-line class="line-numbers-rows"></code-line>      print('get words #{}'.format(i+1), end='\r')
<code-line class="line-numbers-rows"></code-line>      # 例:self.word2idx['李']=1
<code-line class="line-numbers-rows"></code-line>      # self.idx2word[1]='李'
<code-line class="line-numbers-rows"></code-line>      # self.vector[1]='李'
<code-line class="line-numbers-rows"></code-line>      self.word2idx[word]=len(self.word2idx)
<code-line class="line-numbers-rows"></code-line>      self.idx2word.append(word)
<code-line class="line-numbers-rows"></code-line>      self.embedding_matrix.append(self.embedding.wv[word])
<code-line class="line-numbers-rows"></code-line>    # 将embedding_matrix转为tensor类型
<code-line class="line-numbers-rows"></code-line>    self.embedding_matrix = torch.tensor(self.embedding_matrix)
<code-line class="line-numbers-rows"></code-line>    # 将PAD和UNK加入embedding中
<code-line class="line-numbers-rows"></code-line>    self.add_embedding("<PAD>")
<code-line class="line-numbers-rows"></code-line>    self.add_embedding("<UNK>")
<code-line class="line-numbers-rows"></code-line>    print("total words: {}".format(len(self.embedding_matrix)))
<code-line class="line-numbers-rows"></code-line>    return self.embedding_matrix
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>  def pad_sequence(self,sentence):
<code-line class="line-numbers-rows"></code-line>    # 将每个句子变成统一的长度
<code-line class="line-numbers-rows"></code-line>    if len(sentence)>self.sen_len:
<code-line class="line-numbers-rows"></code-line>      sentence = sentence[:self.sen_len]  #截断
<code-line class="line-numbers-rows"></code-line>    else:
<code-line class="line-numbers-rows"></code-line>      pad_len = self.sen_len-len(sentence)
<code-line class="line-numbers-rows"></code-line>      for _ in range(pad_len):
<code-line class="line-numbers-rows"></code-line>        sentence.append(self.word2idx["<PAD>"])
<code-line class="line-numbers-rows"></code-line>    assert len(sentence)==self.sen_len
<code-line class="line-numbers-rows"></code-line>    return sentence
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>  def sentence_word2idx(self):
<code-line class="line-numbers-rows"></code-line>    # 把句子里面的字转成对应的index
<code-line class="line-numbers-rows"></code-line>    sentence_list = []
<code-line class="line-numbers-rows"></code-line>    for i,sen in enumerate(self.sentences):
<code-line class="line-numbers-rows"></code-line>      print('sentence count #{}'.format(i+1), end='\r')
<code-line class="line-numbers-rows"></code-line>      sentence_idx = []
<code-line class="line-numbers-rows"></code-line>      for word in sen:
<code-line class="line-numbers-rows"></code-line>        if(word in self.word2idx.keys()):
<code-line class="line-numbers-rows"></code-line>          sentence_idx.append(self.word2idx[word])
<code-line class="line-numbers-rows"></code-line>        else:
<code-line class="line-numbers-rows"></code-line>          sentence_idx.append(self.word2idx["<UNK>"])
<code-line class="line-numbers-rows"></code-line>      # 把每个句子长度统一
<code-line class="line-numbers-rows"></code-line>      sentence_idx = self.pad_sequence(sentence_idx)
<code-line class="line-numbers-rows"></code-line>      sentence_list.append(sentence_idx)
<code-line class="line-numbers-rows"></code-line>    return torch.LongTensor(sentence_list)
<code-line class="line-numbers-rows"></code-line>
<code-line class="line-numbers-rows"></code-line>  def labels_to_tensor(self,y):
<code-line class="line-numbers-rows"></code-line>    #把标签label也转为tensor
<code-line class="line-numbers-rows"></code-line>    y = [int(label) for label in y]
<code-line class="line-numbers-rows"></code-line>    return torch.LongTensor(y)
</code-pre>

  

  

1
2
3
4
5
6
7
8
9
<code-pre class="code-pre code-pre-line" id="pre-i6E3YZ"><code-line class="line-numbers-rows"></code-line>def train_word2vec(x):
<code-line class="line-numbers-rows"></code-line>  # 训练word embedding
<code-line class="line-numbers-rows"></code-line>  """
<code-line class="line-numbers-rows"></code-line>  Embedding 层的输入是一个二维整数张量, 形状为(samples,sequence_length),即(样本数,序列长度)
<code-line class="line-numbers-rows"></code-line>  Embedding 层输出是(samples,sequence_length,embedding_dimensionality) 的三维浮点数张量。
<code-line class="line-numbers-rows"></code-line>  """
<code-line class="line-numbers-rows"></code-line>  model = word2vec.Word2Vec(x,vector_size=250,window=5,min_count=5,workers=12,epochs=10,sg=1) #iter is epochs
<code-line class="line-numbers-rows"></code-line>  return model
</code-pre>

  

 

 

 


__EOF__

本文作者lishuaics
本文链接https://www.cnblogs.com/L-shuai/p/15814311.html
关于博主:IT小白
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角推荐一下。您的鼓励是博主的最大动力!
posted @   lishuaics  阅读(463)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
点击右上角即可分享
微信分享提示