用Pytorch从零实现Transformer
前言
没有我想象中的难,毕竟站在前人的肩膀上,但还是遇到许多小困难,甚至一度想放弃
用时:两整天(白天)
目的:训练一个transformer模型,输入[1,2,3,4],能预测出[5,6,7,8]
最终效果:transformer model各层及维度符合预期,能train,predict还有一点点问题
主要参考:
https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/transformer_from_scratch/transformer_from_scratch.py
https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
https://zhuanlan.zhihu.com/p/415318478
http://nlp.seas.harvard.edu/2018/04/03/attention.html
https://arxiv.org/pdf/1706.03762.pdf
Transformer部分
主要依据就是论文中的这张图:
先写重点部分:
1. 注意力机制
假设batch_size=2, seq_len=100, d_model=256, heads=8
这里Q,K,V维度都是相同的,由于分头了,将d_model例如拆成heads份,所以维数是[2, 8, 100, 32]
def attention(query, key, value, mask=None, dropout=None): # 取query的最后一维,即embedding的维数 d_k = query.size(-1) #按照注意力公式,将query与key的转置相乘,这里面key是将最后两个维度进行转置,再除以缩放系数得到注意力得分张量scores # 如果query是[len, embed], 那么socres是[len, len] scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: # mask(也是[len, len]) 与 score 每个位置一一比较,如果mask[i][j]为0,则将scores[i][j]改为-1e9 # 负很大的数,在softmax的相当于没有 scores = scores.masked_fill(mask==0, -1e9) # 对最后一维进行softmax scores = F.softmax(scores, dim=-1) if dropout is not None: scores = dropout(scores) # 最后,根据公式将p_attn与value张量相乘获得最终的query注意力表示,同时返回权重 return torch.matmul(scores, value), scores
2. MultiHead Attention
只是将d_model拆成了8份,但并不需要写8次循环,将维数调整成[batch_size, heads, len, d_k],调用前面的attention函数能直接计算
class MultihHeadAttention(nn.Module): def __init__(self, d_model, h, dropout=0.1): super(MultihHeadAttention, self).__init__() # 判断h是否能被d_model整除,这是因为我们之后要给每个头分配等量的词特征 assert d_model % h == 0 #得到每个头获得的分割词向量维度d_k self.d_k = d_model // h self.h = h self.w_key = nn.Linear(d_model, d_model) self.w_query = nn.Linear(d_model, d_model) self.w_value = nn.Linear(d_model, d_model) self.fc_out = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout) self.atten = None # 返回的attention张量,现在还没有,保存给可视化使用 def forward(self, query, key, value, mask=None): if mask is not None: mask = mask.unsqueeze(1) # head导致query等多了一维 batch_size = query.size(0) query = self.w_query(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) key = self.w_key(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) value = self.w_value(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2) x, self.atten = attention(query, key, value, mask, self.dropout) x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k) return self.fc_out(x)
还有两个相对比较简单的层,
3. LayerNorm层
ref https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
可以直接用Pytorch中自带的LayerNorm层,这里自己实现,
就是概率论里的标准化吧,(x-均值)/标准差,只是加了一些调节因子
调节因子的维数可以是和 X 一样,也可以是X的最后一维?试了都能运算,有点没整明白
class LayerNorm(nn.Module): def __init__(self, embedding_dim, eps=1e-6): # embedding_dim: 是一个size, 例如[batch_size, len, embedding_dim], 也可以是embedding_dim。。 super(LayerNorm, self).__init__() # 用 parameter 封装,代表模型的参数,作为调节因子 self.a = nn.Parameter(torch.ones(embedding_dim)) self.b = nn.Parameter(torch.zeros(embedding_dim)) self.eps = eps def forward(self, x): # 其实就是对最后一维做标准化 mean = x.mean(-1, keepdim=True) std = x.std(-1, keepdim=True) return self.a * (x-mean) / (std+self.eps) + self.b
4. FeedForwardLayer层
class FeedForwardLayer(nn.Module): def __init__(self, d_model, forward_expansion): super(FeedForwardLayer, self).__init__() self.w1 = nn.Linear(d_model, d_model*forward_expansion) self.w2 = nn.Linear(d_model*forward_expansion, d_model) def forward(self, x): return self.w2((F.relu(self.w1(x))))
5. Embedding层
然后还有两个Embedding层,
class PositionEmbedding(nn.Module): def __init__(self, d_model, max_len=1000): # max_len是每个句子的最大长度 super(PositionEmbedding, self).__init__() pe = torch.zeros(max_len, d_model) position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0)/d_model)) x = position * div_term pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) # pe: [max_len, d_model] self.register_buffer('pe', pe) def forward(self, x): x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) return x
6. Encoder层
首先定义一个TransformerBlock模块,Encoder只是将其重复num_encoder_layers次
注意有残差运算
class TransformerBlock(nn.Module): def __init__(self, embed_size, head, forward_expansion, dropout): super(TransformerBlock, self).__init__() self.attn = MultihHeadAttention(embed_size, head) self.norm1 = LayerNorm(embed_size) self.norm2 = LayerNorm(embed_size) self.feed_forward = FeedForwardLayer(embed_size, forward_expansion) self.dropout = nn.Dropout(dropout) def forward(self, query, key, value, mask): # ipdb.set_trace() attention = self.attn(query, key, value, mask) x = self.dropout(self.norm1(attention + query)) forward = self.feed_forward(x) out = self.dropout(self.norm2(forward + x)) return out
Encoder真的就只是重复几次,注意,这里我把输入处理放在模块之外
class Encoder(nn.Module): def __init__( self, embed_size, num_layers, heads, forward_expansion, dropout=0.1, ): super(Encoder, self).__init__() self.layers = nn.ModuleList( [ TransformerBlock(embed_size, heads, forward_expansion, dropout) for _ in range(num_layers) ] ) self.dropout = nn.Dropout(dropout) def forward(self, x, mask): # ipdb.set_trace() for layer in self.layers: x = layer(x, x, x, mask) return x
7. Decoder层
定义基本模块为 DecoderBlock,Decoder也只是将其重复多次
有一点需要注意的是这里的query=x,即decoder的上一层输出,而value, key都是来自encoder_out,即encoder最后一层的输出,如图所示:
class DecoderBlock(nn.Module): def __init__(self, embed_size, heads, forward_expansion, dropout=0.1): super(DecoderBlock, self).__init__() self.norm = LayerNorm(embed_size) self.attn = MultihHeadAttention(embed_size, heads, dropout) self.transformer = TransformerBlock(embed_size, heads, forward_expansion, dropout) self.dropout = nn.Dropout(dropout) def forward(self, x, value, key, src_mask, trg_mask): attn = self.attn(x, x, x, trg_mask) query = self.dropout(self.norm(attn+x)) out = self.attn(query, value, key, src_mask) return out
class Decoder(nn.Module): def __init__( self, embed_size, num_layers, heads, forward_expansion, dropout=0.1, ): super(Decoder, self).__init__() self.layers = nn.ModuleList( [ DecoderBlock(embed_size, heads, forward_expansion, dropout) for _ in range(num_layers) ] ) self.dropout = nn.Dropout(dropout) def forward(self, x, encoder_out, src_mask, trg_mask): for layer in self.layers: x = layer(x, encoder_out, encoder_out, src_mask, trg_mask) return x
8. Transformer模块
将Encoder和Decoder拼起来,并在这里集中处理两者的输入
注意,这里有两个mask,一个是为了避免pad=0参与运算,一个是为了atten加权求和的时候不计算后面的
记录一下维数:
假如src和trg是[batch_size, len]
则最终结果是[batch_size, len, trg_vocab_size]
class Transformer(nn.Module): def __init__( self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size=512, num_encoder_layers=6, num_decoder_layers=6, forward_expansion=4, heads=8, dropout=0, max_length=100, device="cpu", ): super(Transformer, self).__init__() self.src_pad_idx = src_pad_idx self.trg_pad_idx = trg_pad_idx self.device = device self.encoder = Encoder( embed_size, num_encoder_layers, heads, forward_expansion, dropout, ) self.decoder = Decoder( embed_size, num_decoder_layers, heads, forward_expansion, dropout, ) # self.word_embedding = WordEmbeddings(embed_size, src_vocab_size) # self.position_embedding = PositionEmbedding(embed_size, max_length) # self.word_embedding_2 = WordEmbeddings(embed_size, trg_vocab_size) # self.position_embedding_2 = PositionEmbedding(embed_size, max_length) self.src_word_embedding = nn.Embedding(src_vocab_size, embed_size) self.src_position_embedding = nn.Embedding(max_length, embed_size) self.trg_word_embedding = nn.Embedding(trg_vocab_size, embed_size) self.trg_position_embedding = nn.Embedding(max_length, embed_size) self.fc_out = nn.Linear(embed_size, trg_vocab_size) self.dropout = nn.Dropout(dropout) def make_src_mask(self, src): src_mask = (src != self.src_pad_idx).unsqueeze(1) # (N, 1, src_len) return src_mask.to(self.device) def make_trg_mask(self, trg): N, trg_len = trg.shape trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand( N, trg_len, trg_len ) def forward(self, src, trg): # ipdb.set_trace() N, src_seq_length = src.shape N, trg_seq_length = trg.shape src_positions = ( torch.arange(0, src_seq_length) .unsqueeze(0) .expand(N, src_seq_length) .to(self.device) ) trg_positions = ( torch.arange(0, trg_seq_length) .unsqueeze(0) .expand(N, trg_seq_length) .to(self.device) ) src_mask = self.make_src_mask(src) trg_mask = self.make_trg_mask(trg) # encoder部分 x = self.dropout( self.src_word_embedding(src) + self.src_position_embedding(src_positions) ) encoder_out = self.encoder(x, src_mask) # decoder部分 x = self.dropout( self.trg_word_embedding(trg) + self.trg_position_embedding(trg_positions) ) decoder_out = self.decoder(x, encoder_out, src_mask, trg_mask) out = self.fc_out(decoder_out) return out
Train部分
相比起model部分,train部分难写得多。因为model结构固定,网上参考的也很多;train部分则与自己的数据紧密相关
1. 生成数据集
ref:
https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
https://sparrow.dev/pytorch-dataloader/
我也单独进行了总结 https://www.cnblogs.com/lfri/p/15479166.html
import csv import random import config header = ['sentence_a', 'sentence_b'] data = [[1,2,3,4], [5,6,7,8]] max_length = config.max_length entry_num = config.entry_num with open(config.file_root, 'w', encoding='UTF8') as f: writer = csv.writer(f) # write the header writer.writerow(header) # write the data # writer.writerow(data) for _ in range(entry_num): s = random.randint(1, max_length/2) len = random.randint(1, max_length/4) data[0] = [i for i in range(s, s+len)] data[1] = [i for i in range(s+len, s+2*len)] writer.writerow(data)
2. 训练
创建Dataset和上面的迭代器train_iterator
dataset = SeqDataset(config.file_root, max_length=config.max_length) train_iterator = DataLoader(dataset, batch_size=config.batch_size, shuffle=False, num_workers=0, collate_fn=None) --snip-- for batch_idx, batch in enumerate(train_iterator): # Get input and targets and get to cuda src, trg = batch src = src.to(config.device) trg = trg.to(config.device)
这样可以得到src和trg,然后可以输入到模型得到输出
output = model(src, trg)
那output与trg计算交叉熵,也就是loss
假如output: [batch_size, len, trg_vocab_size], trg: [batch_size, len],并不能直接计算,需要分别resize成二维和一维
ref https://www.cnblogs.com/lfri/p/15480326.html
output = output.reshape(-1, config.trg_vocab_size) trg = trg.reshape(-1) loss = criterion(output, trg)
然后再反向传播、梯度下降
# Back prop loss.backward() # Gradient descent step optimizer.step()
为了可视化loss,使用了tensorboard
ref
https://towardsdatascience.com/a-complete-guide-to-using-tensorboard-with-pytorch-53cb2301e8c3
https://towardsdatascience.com/pytorch-performance-analysis-with-tensorboard-7c61f91071aa
writer.add_scalar("Training loss", loss, global_step=step) # writer.add_graph(model, [src, target]) # writer.add_histogram("weight", model.decoder.layers[2].attn.atten ,step)
不仅可以可视化loss,还可以可视化model,甚至model某一个的某个权重
3. 预测
最后是进行预测
我没有使用单独的测试集,而只是取一个固定序列,实时检验模型的效果
其中用到了argmax函数:取某维中的最大值,相当于one-hot转index
ref https://www.cnblogs.com/lfri/p/15480326.html
# 评估 model.eval() translated_sentence = my_predict( model, config.device, config.max_length ) --snip-- def my_predict(model, device, max_lenght): indexes = [3, 4, 5, 6, 7] sentence_tensor = torch.LongTensor(indexes).unsqueeze(0).to(device) outputs = [8] for i in range(max_lenght): trg_tensor = torch.LongTensor(outputs).unsqueeze(0).to(device) with torch.no_grad(): output = model(sentence_tensor, trg_tensor) best_guess = output.argmax(2)[:, -1].item() outputs.append(best_guess) # print("best_guess: ", best_guess) if best_guess == 0: break return outputs
训练效果
数据条目:100
num_epochs = 100
测试效果
测试是不可能测试的,能run起来就算成功
my_predict就是测试了,并没有如预想的一样,生成紧接着的等长序列
训练不够?数据集不够?或者模型有问题?或All FAKE?
其他细节
1. to.device()
哪些东西需要绑定到GPU呢?
目前知道的有model, src, trg,以及模型中forward时创建的中间变量,例如本项目中的 src_positions 和 trg_possition
2. dropout
为了防止过拟合,通常都会加一些Dropout层,什么时候加,加到哪有什么讲究吗?
3. bug
又发现一些明显错误,竟然能run。。
To do
- 加数据、epoch_nunms、网络层数训练
- attention可视化