Transformer 模型中的positional encoding(位置编码)计算理解

transformer中的positional encoding(位置编码)计算理解

博客：

https://blog.csdn.net/qq_39783265/article/details/106790875

https://avoid.overfit.cn/post/dc84ff7287e540b48da2eadfabd306bc

https://www.cnblogs.com/zjuhaohaoxuexi/p/15991787.html

视频；

https://www.bilibili.com/video/BV1Di4y1c7Zm?p=2

在人类的语言中，单词的顺序和它们在句子中的位置是非常重要的。如果单词被重新排序后整个句子的意思就会改变，甚至可能变得毫无意义。

Transformers不像LSTM具有处理序列排序的内置机制，它将序列中的每个单词视为彼此独立。所以使用位置编码来保留有关句子中单词顺序的信息。

什么是位置编码？

位置编码（Positional encoding）可以告诉Transformers模型一个实体/单词在序列中的位置/索引，这样就为每个位置分配一个唯一的表示。虽然最简单的方法是使用索引值来表示位置，但这对于长序列来说，索引值会变得很大，这样就会产生很多的问题。

位置编码将每个位置/索引都映射到一个向量。所以位置编码层的输出是一个矩阵，其中矩阵中的每一行是序列中的编码字与其位置信息的和。

如下图所示为仅对位置信息进行编码的矩阵示例。

Transformers 中的位置编码层

假设我们有一个长度为 L 的输入序列，并且我们需要对象在该序列中的位置。位置编码由不同频率的正弦和余弦函数给出：

d：输出嵌入空间的维度【Transformer中为512】

pos：输入序列中的单词位置，0≤pos≤L-1 【序列中某个词的位置，位置从0开始，到长度-1】

i：用于映射到列索引, 其中0≤i<d/2，并且i 的单个值还会映射到正弦和余弦函数【i的取值从0开始，到512/2】

在上面的表达式中，我们可以看到偶数位置使用正弦函数，奇数位置使用余弦函数(此句中的位置指的不是词在序列中的位置，而是指位置向量中的某个维度；比如512维度的位置向量，位置从0开始到511为止。)。

从头编写位置编码矩阵

下面是一小段使用NumPy实现位置编码的Python代码。代码经过简化，便于理解位置编码。

import numpy as np

def getPositionEncoding(seq_len,dim,n=10000):
    PE = np.zeros(shape=(seq_len,dim))
    for pos in range(seq_len):
        # print("pos=",pos)
        for i in range(int(dim/2)):
            # print("i=",i)
            denominator = np.power(n, 2*i/dim)
            print("pos=",pos," i=",i, " 2*i=", 2*i, " 2*i+1=", 2*i+1 )
            PE[pos,2*i] = np.sin(pos/denominator)
            PE[pos,2*i+1] = np.cos(pos/denominator)

    return PE


PE = getPositionEncoding(seq_len=4, dim=4, n=100) # seq_len序列长度为4，表示有四个词； dim 表示位置向量的维度为4 （偶数为佳）； n=100代替 公式里面的 10000
print(PE)

结果如下：

pos= 0  i= 0  2*i= 0  2*i+1= 1
pos= 0  i= 1  2*i= 2  2*i+1= 3
pos= 1  i= 0  2*i= 0  2*i+1= 1
pos= 1  i= 1  2*i= 2  2*i+1= 3
pos= 2  i= 0  2*i= 0  2*i+1= 1
pos= 2  i= 1  2*i= 2  2*i+1= 3
pos= 3  i= 0  2*i= 0  2*i+1= 1
pos= 3  i= 1  2*i= 2  2*i+1= 3
[[ 0.          1.          0.          1.        ]
 [ 0.84147098  0.54030231  0.09983342  0.99500417]
 [ 0.90929743 -0.41614684  0.19866933  0.98006658]
 [ 0.14112001 -0.9899925   0.29552021  0.95533649]]

为了更好的理解位置彪马，我们可以对其进行可视化，让我们在更大的值上可视化位置矩阵。我们将从matplotlib库中使用Python的matshow()方法。比如设置n=10,000，得到:

Setting n=10,000 as done in the original paper, you get the following:

P = getPositionEncoding(seq_len=100, d=512, n=10000) ## 100个词； 512维位置向量；
cax = plt.matshow(P)
plt.gcf().colorbar(cax)

因此，位置编码层将单词嵌入与序列中每个标记的位置编码矩阵相加，作为下一层的输入。这里需要注意的是位置编码矩阵的维数应该与词嵌入的维数相同。

当i比较大的时候，即图的右侧，基本得0，或，1：

import numpy as np
n=10000
pos=5
dim=512 ###
i=400
denominator = np.power(n, 2*i/dim)
print("pos/denominator=",pos/denominator, np.sin(pos/denominator) )
print("pos/denominator=",pos/denominator, np.cos(pos/denominator))
i=3 ###
denominator = np.power(n, 2*i/dim)
print("pos/denominator=",pos/denominator, np.sin(pos/denominator) )
print("pos/denominator=",pos/denominator, np.cos(pos/denominator))

pos/denominator= 2.811706625951745e-06 2.8117066259480403e-06
pos/denominator= 2.811706625951745e-06 0.9999999999960472
pos/denominator= 4.488435662236571 -0.9750270944422548
pos/denominator= 4.488435662236571 -0.2220859408055681

缺陷，不足，缺点：

在 Keras 中编写自己的位置编码层

首先，让我们编写导入所有必需库。

import tensorflow as tf
from tensorflow import convert_to_tensor, string
from tensorflow.keras.layers import TextVectorization, Embedding, Layer
from tensorflow.data import Dataset
import numpy as np
import matplotlib.pyplot as plt

以下代码使用 Tokenizer 对象将每个文本转换为整数序列（每个整数是字典中标记的索引）。

output_sequence_length = 4
vocab_size = 10
sentences = ["How are you doing", "I am doing good"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
tokenzied_sent = tokenizer.texts_to_sequences(sentences)
print("Vectorized words: ", tokenzied_sent)

实现transformer 模型时，必须编写自己的位置编码层。这个 Keras 示例展示了如何编写 Embedding 层子类：

class PositionEmbeddingLayer(Layer):
    def __init__(self, sequence_length, vocab_size, output_dim, **kwargs):
        super(PositionEmbeddingLayer, self).__init__(**kwargs)
        self.word_embedding_layer = Embedding(
            input_dim=vocab_size, output_dim=output_dim
        )
        self.position_embedding_layer = Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
 
    def call(self, inputs):        
        position_indices = tf.range(tf.shape(inputs)[-1])
        embedded_words = self.word_embedding_layer(inputs)
        embedded_indices = self.position_embedding_layer(position_indices)
        return embedded_words + embedded_indices

这样我们的位置嵌入就完成了

作者：Srinidhi Karjol

https://medium.com/mlearning-ai/how-positional-embeddings-work-in-self-attention-ef74e99b6316

https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

==================================================================

Here is an awesome recent Youtube video that covers position embeddings in great depth, with beautiful animations:

Visual Guide to Transformer Neural Networks - (Part 1) Position Embeddings

Taking excerpts from the video, let us try understanding the “sin” part of the formula to compute the position embeddings:

Here “pos” refers to the position of the “word” in the sequence. P0 refers to the position embedding of the first word; “d” means the size of the word/token embedding. In this example d=5. Finally, “i” refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4)

While “d” is fixed, “pos” and “i” vary. Let us try understanding the later two.

"pos"

If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different position embeddings values.

There is a problem though. Since “sin” curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.

"i"

If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.

https://datascience.stackexchange.com/questions/51065/what-is-the-positional-encoding-in-the-transformer-model

================================

https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

The intuition

You may wonder how this combination of sines and cosines could ever represent a position/order? It is actually quite simple, Suppose you want to represent a number in binary format, how will that be?

You can spot the rate of change between different bits. The LSB bit is alternating on every number, the second-lowest bit is rotating on every two numbers, and so on.

But using binary values would be a waste of space in the world of floats. So instead, we can use their float continous counterparts - Sinusoidal functions. Indeed, they are the equivalent to alternating bits. Moreover, By decreasing their frequencies, we can go from red bits to orange ones.

$\begin{matrix} \end{matrix}$

$\begin{matrix} Figure 2 - The 128-dimensional positonal encoding for a sentence with the maximum lenght of 50. Each row represents the embedding vector \topt \end{matrix}$