注意力机制【4】-多头注意力机制

所谓自注意力机制就是通过某种运算来直接计算得到句子在编码过程中每个位置上的注意力权重；然后再以权重和的形式来计算得到整个句子的隐含向量表示。

自注意力机制的缺陷就是：模型在对当前位置的信息进行编码时，会过度的将注意力集中于自身的位置，因此作者提出了通过多头注意力机制来解决这一问题。

实验证明，多头注意力机制效果优于单头注意力，计算框架如下图

V K Q 是固定的单个值，linear 层有3个，Scaled Dot-Product Attention 有3个，即 3个头；　　【Scaled Dot-Product Attention 后面会讲】

类似于堆积了 3个单头注意力；　　

之后 concat ，或者 sum，最后 linear；

其实多头注意力就是把单头注意力执行了多次，然后把结果进行合并

代码

在具体代码中，使用如下流程图，思路更清晰明了

import math
import torch
import torch.nn as nn


class LayerNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-12):
        """Construct a layernorm module in the TF style (epsilon inside the square root).
        """
        super(LayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias


class SelfAttention(nn.Module):
    def __init__(self, num_attention_heads, input_size, hidden_size, hidden_dropout_prob):
        super(SelfAttention, self).__init__()
        if hidden_size % num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (hidden_size, num_attention_heads))
        self.num_attention_heads = num_attention_heads
        self.attention_head_size = int(hidden_size / num_attention_heads)   # 每头大小
        self.all_head_size = hidden_size

        self.query = nn.Linear(input_size, self.all_head_size)
        self.key = nn.Linear(input_size, self.all_head_size)
        self.value = nn.Linear(input_size, self.all_head_size)

        self.attn_dropout = nn.Dropout(0.8)

        # 做完self-attention 做一个前馈全连接 LayerNorm 输出
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.LayerNorm = LayerNorm(hidden_size, eps=1e-12)
        self.out_dropout = nn.Dropout(hidden_dropout_prob)

    def transpose_for_scores(self, x):
        # x 最后一维即 input_size(例如10)，把这维转换成 heads_num * head_size（例如 2*5）
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)    # [b h w] --> [b h heads headsize]
        return x.permute(0, 2, 1, 3) # [b heads h headsize]  heads 类似通道数

    def forward(self, input_tensor):
        mixed_query_layer = self.query(input_tensor)    # [b h hiddensize]
        mixed_key_layer = self.key(input_tensor)
        mixed_value_layer = self.value(input_tensor)

        query_layer = self.transpose_for_scores(mixed_query_layer)  # [b heads h headsize]
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # qk = [b heads h headsize] * [b heads headsize h] = [b heads h h]
        # 消掉了最后一维
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # score 标准化，只是除了一个定值，不必纠结
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        # [batch_size heads seq_len seq_len] scores
        # [batch_size 1 1 seq_len]

        # attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)  # [b heads h h]
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        # Fixme
        attention_probs = self.attn_dropout(attention_probs) # 对一些激活值dropout
        # softmax * v = [b head h h] * [b heads h headsize] = [b heads h headsize]
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()  # [b h heads headsize]
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)    # [b h inputsize]
        # 归纳：注意力机制到底干了啥？
        # 其实他就是把初始的编码转换成新的编码，而且新旧编码长度是相同的，而注意力机制相当于给旧编码进行了加权

        ### 下面不属于 注意力机制了
        hidden_states = self.dense(context_layer)
        hidden_states = self.out_dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)

        return hidden_states

再来一张图帮助理解

Z0是单个head生成的新编码，将8个Z横着拼起来，8x3=24，生成（2，24）的矩阵，乘以Wo(24, 4)，得到新编码Z(2, 4)

小结

1.注意力机制的输入和输出 shape 是原则上是相同的，这样方便理解和计算，【即使不同，乘个矩阵就相同了】

2.把单头分解成多头，相当于一个大领导负责一个大项目，分解成多个小组长分别负责一个子项目，之后再做项目合并

优势：

多头注意力融合了来自于相同的注意力池化产生的不同知识，这些知识的不同来源于相同的查询、键和值的不同的子空间表示；

可并行处理；

参考资料：

https://zhuanlan.zhihu.com/p/484524337　　（多头）自注意力机制

https://zhuanlan.zhihu.com/p/365386753　　Multi-headed Self-attention（多头自注意力）机制介绍

https://blog.csdn.net/beilizhang/article/details/115282604　　代码

https://zhuanlan.zhihu.com/p/486137878　　基于pytorch实现（多头）自注意力代码

https://blog.csdn.net/HappyCtest/article/details/109847449 　　Pytorch 自带多头注意力

发表于 2022-09-30 08:28 努力的孔子阅读(7525) 评论(0) 编辑收藏举报

刷新页面返回顶部

注意力机制【4】-多头注意力机制

代码

小结

导航