用Pytorch从零实现Transformer

前言

没有我想象中的难,毕竟站在前人的肩膀上,但还是遇到许多小困难,甚至一度想放弃

用时:两整天(白天)

目的:训练一个transformer模型,输入[1,2,3,4],能预测出[5,6,7,8]

最终效果:transformer model各层及维度符合预期,能train,predict还有一点点问题

主要参考:

https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/transformer_from_scratch/transformer_from_scratch.py

https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py

https://zhuanlan.zhihu.com/p/415318478

http://nlp.seas.harvard.edu/2018/04/03/attention.html

https://arxiv.org/pdf/1706.03762.pdf

Transformer部分

主要依据就是论文中的这张图:

先写重点部分:

1. 注意力机制

假设batch_size=2, seq_len=100, d_model=256, heads=8

这里Q,K,V维度都是相同的,由于分头了,将d_model例如拆成heads份,所以维数是[2, 8, 100, 32]

def attention(query, key, value, mask=None, dropout=None):
    # 取query的最后一维,即embedding的维数
    d_k = query.size(-1)  
    #按照注意力公式,将query与key的转置相乘,这里面key是将最后两个维度进行转置,再除以缩放系数得到注意力得分张量scores
    # 如果query是[len, embed], 那么socres是[len, len]
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        # mask(也是[len, len]) 与 score 每个位置一一比较,如果mask[i][j]为0,则将scores[i][j]改为-1e9
        # 负很大的数,在softmax的相当于没有
        scores = scores.masked_fill(mask==0, -1e9)

    # 对最后一维进行softmax
    scores = F.softmax(scores, dim=-1)

    if dropout is not None:
        scores = dropout(scores)

    # 最后,根据公式将p_attn与value张量相乘获得最终的query注意力表示,同时返回权重
    return torch.matmul(scores, value), scores

2. MultiHead Attention

只是将d_model拆成了8份,但并不需要写8次循环,将维数调整成[batch_size, heads, len, d_k],调用前面的attention函数能直接计算

class MultihHeadAttention(nn.Module):
    def __init__(self, d_model, h, dropout=0.1):
        super(MultihHeadAttention, self).__init__()
        # 判断h是否能被d_model整除,这是因为我们之后要给每个头分配等量的词特征
        assert d_model % h == 0
        #得到每个头获得的分割词向量维度d_k
        self.d_k = d_model // h
        self.h = h

        self.w_key = nn.Linear(d_model, d_model)
        self.w_query = nn.Linear(d_model, d_model)
        self.w_value = nn.Linear(d_model, d_model)
        self.fc_out = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

        self.atten = None  # 返回的attention张量,现在还没有,保存给可视化使用

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            mask = mask.unsqueeze(1) # head导致query等多了一维

        batch_size = query.size(0)
        query = self.w_query(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        key = self.w_key(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        value = self.w_value(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        x, self.atten = attention(query, key, value, mask, self.dropout)
        

        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)

        return self.fc_out(x)

还有两个相对比较简单的层,

3. LayerNorm层

ref https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

可以直接用Pytorch中自带的LayerNorm层,这里自己实现,

就是概率论里的标准化吧,(x-均值)/标准差,只是加了一些调节因子

调节因子的维数可以是和 X 一样,也可以是X的最后一维?试了都能运算,有点没整明白

class LayerNorm(nn.Module):
    def __init__(self, embedding_dim, eps=1e-6):  # embedding_dim: 是一个size, 例如[batch_size, len, embedding_dim], 也可以是embedding_dim。。
        super(LayerNorm, self).__init__()
        # 用 parameter 封装,代表模型的参数,作为调节因子
        self.a = nn.Parameter(torch.ones(embedding_dim))
        self.b = nn.Parameter(torch.zeros(embedding_dim))
        self.eps = eps

    def forward(self, x):
        # 其实就是对最后一维做标准化
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a * (x-mean) / (std+self.eps) + self.b

4. FeedForwardLayer层

先将维度提升forward_expansion倍,经过relu激活函数,又将维度降回来😅
class FeedForwardLayer(nn.Module):
    def __init__(self, d_model, forward_expansion):
        super(FeedForwardLayer, self).__init__()
        self.w1 = nn.Linear(d_model, d_model*forward_expansion)
        self.w2 = nn.Linear(d_model*forward_expansion, d_model)

    def forward(self, x):
        return self.w2((F.relu(self.w1(x))))

5. Embedding层

然后还有两个Embedding层,

WordEmbeddings比较简单,就是正常的word embedding
PositionEmbedding论文原文比较魔幻,用普通的也影响不大
所以这里实现了,但最后用的nn.Embedding
class PositionEmbedding(nn.Module):
    def __init__(self, d_model, max_len=1000): # max_len是每个句子的最大长度
        super(PositionEmbedding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0)/d_model))
        x = position * div_term
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # pe: [max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)
        return x

6. Encoder层

首先定义一个TransformerBlock模块,Encoder只是将其重复num_encoder_layers次

注意有残差运算

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, head, forward_expansion, dropout):
        super(TransformerBlock, self).__init__()

        self.attn = MultihHeadAttention(embed_size, head)
        self.norm1 = LayerNorm(embed_size)
        self.norm2 = LayerNorm(embed_size)
        self.feed_forward = FeedForwardLayer(embed_size, forward_expansion)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, mask):
        # ipdb.set_trace()
        attention =  self.attn(query, key, value, mask)
        
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

Encoder真的就只是重复几次,注意,这里我把输入处理放在模块之外

class Encoder(nn.Module):
    def __init__(
        self, 
        embed_size, 
        num_layers, 
        heads, 
        forward_expansion, 
        dropout=0.1,
    ):
        super(Encoder, self).__init__()

        self.layers = nn.ModuleList(
            [
                TransformerBlock(embed_size, heads, forward_expansion, dropout)
                for _ in range(num_layers)
            ]
        )
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask):
        # ipdb.set_trace()
        for layer in self.layers:
            x = layer(x, x, x, mask)

        return x

7. Decoder层

定义基本模块为 DecoderBlock,Decoder也只是将其重复多次

有一点需要注意的是这里的query=x,即decoder的上一层输出,而value, key都是来自encoder_out,即encoder最后一层的输出,如图所示:

class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout=0.1):
        super(DecoderBlock, self).__init__()
        self.norm = LayerNorm(embed_size)
        self.attn = MultihHeadAttention(embed_size, heads, dropout)
        self.transformer = TransformerBlock(embed_size, heads, forward_expansion, dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, value, key, src_mask, trg_mask):
        attn = self.attn(x, x, x, trg_mask)
        query = self.dropout(self.norm(attn+x))
        out = self.attn(query, value, key, src_mask)
        return out
class Decoder(nn.Module):
    def __init__(
        self,
        embed_size,
        num_layers,
        heads,
        forward_expansion,
        dropout=0.1,
    ):
        super(Decoder, self).__init__()
        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, forward_expansion, dropout)
                for _ in range(num_layers)
            ]
            
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_out, src_mask, trg_mask):
        for layer in self.layers:
            x = layer(x, encoder_out, encoder_out, src_mask, trg_mask)

        return x

8. Transformer模块

将Encoder和Decoder拼起来,并在这里集中处理两者的输入

注意,这里有两个mask,一个是为了避免pad=0参与运算,一个是为了atten加权求和的时候不计算后面的

记录一下维数:

假如src和trg是[batch_size, len]

则最终结果是[batch_size, len, trg_vocab_size]

class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        trg_pad_idx,
        embed_size=512,
        num_encoder_layers=6,
        num_decoder_layers=6,
        forward_expansion=4,
        heads=8,
        dropout=0,
        max_length=100,  
        device="cpu",  
    ):
        super(Transformer, self).__init__()
        
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

        self.encoder = Encoder(
            embed_size,
            num_encoder_layers,
            heads,
            forward_expansion,
            dropout,
        )
        self.decoder = Decoder(
            embed_size,
            num_decoder_layers,
            heads,
            forward_expansion,
            dropout,
        )
        # self.word_embedding = WordEmbeddings(embed_size, src_vocab_size)
        # self.position_embedding = PositionEmbedding(embed_size, max_length)
        # self.word_embedding_2 = WordEmbeddings(embed_size, trg_vocab_size)
        # self.position_embedding_2 = PositionEmbedding(embed_size, max_length)
        self.src_word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.src_position_embedding = nn.Embedding(max_length, embed_size)
        self.trg_word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.trg_position_embedding = nn.Embedding(max_length, embed_size)

        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def make_src_mask(self, src):
        src_mask = (src != self.src_pad_idx).unsqueeze(1)
        # (N, 1, src_len)
        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, trg_len, trg_len
        )

    def forward(self, src, trg):
        # ipdb.set_trace()
        N, src_seq_length = src.shape
        N, trg_seq_length = trg.shape
        src_positions = (
            torch.arange(0, src_seq_length)
            .unsqueeze(0)
            .expand(N, src_seq_length)
            .to(self.device)
        )

        trg_positions = (
            torch.arange(0, trg_seq_length)
            .unsqueeze(0)
            .expand(N, trg_seq_length)
            .to(self.device)
        )

        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        # encoder部分
        x = self.dropout(
            self.src_word_embedding(src) + self.src_position_embedding(src_positions)
        )
        encoder_out = self.encoder(x, src_mask)
        # decoder部分
        x = self.dropout(
            self.trg_word_embedding(trg) + self.trg_position_embedding(trg_positions)
        )
        decoder_out = self.decoder(x, encoder_out, src_mask, trg_mask)

        out = self.fc_out(decoder_out)

        return out

Train部分

相比起model部分,train部分难写得多。因为model结构固定,网上参考的也很多;train部分则与自己的数据紧密相关

1. 生成数据集

ref:

https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

https://sparrow.dev/pytorch-dataloader/

我也单独进行了总结 https://www.cnblogs.com/lfri/p/15479166.html

需要成对的序列,长度相同,后一个在数值上紧接着前一个,例如[[1,2,3,4], [5,6,7,8]]
generate_data.py,将生成的数据保存在cvs文件中
import csv  
import random
import config

header = ['sentence_a', 'sentence_b']
data = [[1,2,3,4], [5,6,7,8]]
max_length = config.max_length
entry_num = config.entry_num

with open(config.file_root, 'w', encoding='UTF8') as f:
    writer = csv.writer(f)

    # write the header
    writer.writerow(header)

    # write the data
    # writer.writerow(data)

    for _ in range(entry_num):
        s = random.randint(1, max_length/2)
        len = random.randint(1, max_length/4)
        data[0] = [i for i in range(s, s+len)]
        data[1] = [i for i in range(s+len, s+2*len)]
        writer.writerow(data)

        

2. 训练

创建Dataset和上面的迭代器train_iterator

dataset = SeqDataset(config.file_root, max_length=config.max_length)
train_iterator = DataLoader(dataset, batch_size=config.batch_size,
                        shuffle=False, num_workers=0,  collate_fn=None)


--snip--

    for batch_idx, batch in enumerate(train_iterator):
        # Get input and targets and get to cuda
        src, trg = batch
        src = src.to(config.device)
        trg = trg.to(config.device)

这样可以得到src和trg,然后可以输入到模型得到输出

output = model(src, trg)

那output与trg计算交叉熵,也就是loss

假如output: [batch_size, len, trg_vocab_size], trg: [batch_size, len],并不能直接计算,需要分别resize成二维和一维

ref https://www.cnblogs.com/lfri/p/15480326.html

        output = output.reshape(-1, config.trg_vocab_size)
        trg = trg.reshape(-1)
        loss = criterion(output, trg)

然后再反向传播、梯度下降

# Back prop
loss.backward()

# Gradient descent step
optimizer.step()

为了可视化loss,使用了tensorboard

ref 

https://towardsdatascience.com/a-complete-guide-to-using-tensorboard-with-pytorch-53cb2301e8c3

https://towardsdatascience.com/pytorch-performance-analysis-with-tensorboard-7c61f91071aa

writer.add_scalar("Training loss", loss, global_step=step)
# writer.add_graph(model, [src, target])
# writer.add_histogram("weight", model.decoder.layers[2].attn.atten ,step)

不仅可以可视化loss,还可以可视化model,甚至model某一个的某个权重

3. 预测

最后是进行预测

我没有使用单独的测试集,而只是取一个固定序列,实时检验模型的效果

其中用到了argmax函数:取某维中的最大值,相当于one-hot转index

ref https://www.cnblogs.com/lfri/p/15480326.html

    # 评估
    model.eval()
    translated_sentence = my_predict(
        model, config.device, config.max_length
    )


--snip--

def my_predict(model, device, max_lenght):
    indexes = [3, 4, 5, 6, 7]
    sentence_tensor = torch.LongTensor(indexes).unsqueeze(0).to(device)
    outputs = [8]
    for i in range(max_lenght):
        trg_tensor = torch.LongTensor(outputs).unsqueeze(0).to(device)
       
        with torch.no_grad():
            output = model(sentence_tensor, trg_tensor)

        best_guess = output.argmax(2)[:, -1].item()
        outputs.append(best_guess)
        # print("best_guess: ", best_guess)

        if best_guess == 0:
            break

    return outputs

训练效果

数据条目:100

num_epochs = 100

用cpu几分钟就训练完了。。。

测试效果

测试是不可能测试的,能run起来就算成功

my_predict就是测试了,并没有如预想的一样,生成紧接着的等长序列

训练不够?数据集不够?或者模型有问题?或All FAKE?

其他细节

1. to.device()

哪些东西需要绑定到GPU呢?

目前知道的有model, src, trg,以及模型中forward时创建的中间变量,例如本项目中的 src_positions 和 trg_possition

2. dropout

为了防止过拟合,通常都会加一些Dropout层,什么时候加,加到哪有什么讲究吗?

3. bug

又发现一些明显错误,竟然能run。。

To do

  • 加数据、epoch_nunms、网络层数训练
  • attention可视化 

 

posted @ 2021-10-29 23:49  Rogn  阅读(3397)  评论(1编辑  收藏  举报