所谓自注意力机制就是通过某种运算来直接 计算得到句子 在编码过程中每个位置上的注意力权重;然后再以权重和的形式来计算得到整个句子的隐含向量表示。

自注意力机制的缺陷就是:模型在对当前位置的信息进行编码时,会过度的将注意力集中于自身的位置, 因此作者提出了通过多头注意力机制来解决这一问题。

实验证明,多头注意力机制 效果优于 单头注意力,计算框架如下图 

V K Q 是固定的单个值,linear 层有3个,Scaled Dot-Product Attention 有3个,即 3个头;  【Scaled Dot-Product Attention 后面会讲】

类似于堆积了 3个 单头注意力;  

之后 concat ,或者 sum,最后 linear;

其实多头注意力 就是把 单头注意力 执行了多次,然后把结果进行合并 

 

代码

在具体代码中,使用如下流程图,思路更清晰明了 

            

import math
import torch
import torch.nn as nn


class LayerNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-12):
        """Construct a layernorm module in the TF style (epsilon inside the square root).
        """
        super(LayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias


class SelfAttention(nn.Module):
    def __init__(self, num_attention_heads, input_size, hidden_size, hidden_dropout_prob):
        super(SelfAttention, self).__init__()
        if hidden_size % num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (hidden_size, num_attention_heads))
        self.num_attention_heads = num_attention_heads
        self.attention_head_size = int(hidden_size / num_attention_heads)   # 每头大小
        self.all_head_size = hidden_size

        self.query = nn.Linear(input_size, self.all_head_size)
        self.key = nn.Linear(input_size, self.all_head_size)
        self.value = nn.Linear(input_size, self.all_head_size)

        self.attn_dropout = nn.Dropout(0.8)

        # 做完self-attention 做一个前馈全连接 LayerNorm 输出
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.LayerNorm = LayerNorm(hidden_size, eps=1e-12)
        self.out_dropout = nn.Dropout(hidden_dropout_prob)

    def transpose_for_scores(self, x):
        # x 最后一维即 input_size(例如10),把这维转换成 heads_num * head_size(例如 2*5)
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)    # [b h w] --> [b h heads headsize]
        return x.permute(0, 2, 1, 3) # [b heads h headsize]  heads 类似通道数

    def forward(self, input_tensor):
        mixed_query_layer = self.query(input_tensor)    # [b h hiddensize]
        mixed_key_layer = self.key(input_tensor)
        mixed_value_layer = self.value(input_tensor)

        query_layer = self.transpose_for_scores(mixed_query_layer)  # [b heads h headsize]
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # qk = [b heads h headsize] * [b heads headsize h] = [b heads h h]
        # 消掉了最后一维
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # score 标准化,只是除了一个定值,不必纠结
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        # [batch_size heads seq_len seq_len] scores
        # [batch_size 1 1 seq_len]

        # attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)  # [b heads h h]
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        # Fixme
        attention_probs = self.attn_dropout(attention_probs) # 对一些激活值dropout
        # softmax * v = [b head h h] * [b heads h headsize] = [b heads h headsize]
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()  # [b h heads headsize]
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)    # [b h inputsize]
        # 归纳:注意力机制到底干了啥?
        # 其实他就是把初始的编码转换成新的编码,而且新旧编码长度是相同的,而注意力机制相当于给旧编码进行了加权

        ### 下面不属于 注意力机制了
        hidden_states = self.dense(context_layer)
        hidden_states = self.out_dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)

        return hidden_states

再来一张图帮助理解

Z0是单个head生成的新编码, 将8个Z横着拼起来,8x3=24,生成(2,24)的矩阵,乘以Wo(24, 4),得到 新编码Z(2, 4)

 

小结 

1.注意力机制的 输入和输出 shape 是原则上是相同的,这样方便理解和计算,【即使不同,乘个矩阵就相同了】

2.把单头分解成多头,相当于一个大领导负责一个大项目,分解成多个小组长分别负责一个子项目,之后再做项目合并 

优势

多头注意力融合了来自于相同的注意力池化产生的不同知识,这些知识的不同 来源于相同的查询、键和值的 不同的子空间表示;

可并行处理;

 

 

 

 

参考资料:

https://zhuanlan.zhihu.com/p/484524337  (多头)自注意力机制

https://zhuanlan.zhihu.com/p/365386753  Multi-headed Self-attention(多头自注意力)机制介绍

https://blog.csdn.net/beilizhang/article/details/115282604  代码

https://zhuanlan.zhihu.com/p/486137878  基于pytorch实现(多头)自注意力代码

https://blog.csdn.net/HappyCtest/article/details/109847449      Pytorch 自带 多头注意力