xinyu04

导航

Deep Learning Week13 Notes

1. Attention for Memory and Sequence Translation

Attention mechanisms aggregate features with an importance score that:

  • depends on the feature themselves, not on their positions in the tensor
  • relax locality constraints.

\(\Large\text{Note:}\)

  • The attention mechanism allows the information to move from one part of the tensor to another part far way
  • For instance, in the case of sequence-to-sequence translation, it is able to use an information from early in the sentence to do a proper grammatical decision later
  • For images, it is able to combine information from different parts of the image even if there are far away

Neural Turing Machine

\(\large\textbf{Illustration: refer }\) Lecture-P6

The said module has an hidden internal state that takes the form of a tensor:

\[M_t\in \mathbb{R}^{N\times M} \]

where \(t\) is the time step, \(N\) is the number of entries in the memory and \(M\) is their dimension.

A “controller” is implemented as a standard feed-forward or recurrent model and at every iteration \(t\) it computes activations that modulate the reading / writing operations.

More formally, the memory module implements:

  • Reading, where given attention weights \(w_t\in\mathbb{R}_{+}^N, \sum_nw_t(n)=1\), it gets

\[r_t = \sum_{n=1}^Nw_t(n)M_t(n) \]

  • Writing, where given attention weights \(w_t\), an erase vector \(e_t\in [0,1]^M\), and an add vector \(a_t\in \mathbb{R}^M\), the memory is updated with:

\[\forall n, M_t(n) = M_{t-1}(n)[1-w_t(n)e_t]+w_t(n)a_t \]

The controller has multiple “heads”, and computes at each \(t\), for each writing head \(w_t, e_t, a_t\), and for each reading head \(w_t\), and gets back a read value \(r_t\).

Attention for seq2seq

Given an input sequence \(x_1,...,x_T\), the standard approach for sequence- to-sequence translation (Sutskever et al., 2014) uses a recurrent model:

\[h_t = f(x_t,h_{t-1}) \]

and considers that the final hidden state:

\[v = h_T \]

carries enough information to drive an auto-regressive generative model:

\[y_t\sim p(y_1,...,y_{t-1},v) \]

itself implemented with another RNN.

$\LARGE \star $ The main weakness of such an approach is that all the information has to flow through a single state \(v\), whose capacity has to accommodate any situation. There are no direct “channels” to transport local information from the input sequence to the place where it is useful in the resulting sequence.

Attention mechanisms (Bahdanau et al., 2014) can transport information from parts of the signal to other parts specified dynamically.

Bahdanau et al. (2014) proposed to extend a standard recurrent model with such a mechanism. They first run a bi-directionnal RNN to get a hidden state:

\[h_{i}=\left(h_{i}^{\rightarrow}, h_{i}^{\leftarrow}\right), \quad i=1, \ldots, T \]

From this, they compute a new process \(s_i,i = 1,...,T\) which looks weighted averages of the \(h_j\), where the weights are functions of the signal.

Given \(y_1,...,y_{i-1}\) and \(s_1,...,s_{i-1}\) first compute an attention:

\[\forall j, \alpha_{i, j}=\operatorname{softmax}_{j} a\left(s_{i-1}, h_{j}\right) \]

where \(a\) is a one hidden layer \(\tanh\) MLP. Then compute the context vector from \(h\):

\[c_i = \sum_{j=1}^T \alpha_{i,j} h_j \]

The model can now make the prediction:

\[\begin{align} s_i &= f(s_{i-1},y_{i-1},c_i)\\ y_i&\sim g(y_{i-1},s_i,c_i) \end{align} \]

where \(f\) is GRU.

\(\Large\textbf{Illustration: refer }\) Lecture-P20

2. Attention Mechanisms

  • The simplest form of attention is content-based attention. Given an “attention function”:

\[a:\mathbb{R}^{D'}\times\mathbb{R}^D\rightarrow \mathbb{R} \]

and model parameters:

\[\theta\in \mathbb{R}^{T\times D} \]

this operation takes a “value” tensor as input:

\[V\in \mathbb{R}^{T'\times D'} \]

and compute the output:

\[Y\in\mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \frac{\exp \left(a\left(V_{i} ; \theta_{j}\right)\right)}{\sum_{k=1}^{T} \exp \left(a\left(V_{k} ; \theta_{j}\right)\right)} V_{i} \\ &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(V_{i} ; \theta_{j}\right)\right) V_{i} \end{aligned} \]

  • This differs from context attention, which, given two inputs: a “context” tensor:

\[C\in \mathbb{R}^{T\times D} \]

and a "value" tensor:

\[V\in \mathbb{R}^{T'\times D} \]

computes a tensor

\[Y\in \mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(C_j,V_{i} ; \theta\right)\right) V_{i} \end{aligned} \]

\(\large\text{Illustration the difference: }\)Lecture-P4

Using the terminology of Graves et al. (2014), attention is an averaging of values associated to keys matching a query. Hence the keys used for computing attention and the values to average are different quantities.

Given a query sequence \(Q\in\mathbb{R}^{T\times D}\), a key sequence \(K\in \mathbb{R}^{T'\times D}\) and a value sequence \(V\in\mathbb{R}^{T'\times D'}\). Compute a matrix \(A\in \mathbb{R}^{T\times T'}\), by matching \(Q\) to \(K\), and weight \(V\) with it to get the result sequence \(Y\in\mathbb{R}^{T\times D'}\),

\[\begin{align} \forall i, A_i &= \text{softmax}(\frac{KQ_i}{\sqrt{D}})\\ Y_i &= V^TA_i \end{align} \]

or

\[\begin{align} A &= \text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})\in \mathbb{R}^{T\times T'}\\ Y&= AV\in\mathbb{R}^{T\times D'} \end{align} \]

The queries and keys have the same dimension \(D\), and there are as many keys \(T'\) as there are values. The result \(Y\) has as many rows \(T\) as there are queries, they are of same dimension \(D'\) as the values.

\(\large\text{Illustration: refer }\) Lecture-P9.

A standard attention layer takes as input two sequences \(X\) and \(X'\), and computes the tensors \(K,V,Q\) as the linear functions:

\[\begin{align} K&= W^KX\\ V&=W^VX\\ Q&=W^QX'\\ Y&=\text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})V \end{align} \]

When \(X = X'\) , this is self attention, otherwise it is cross attention.

Multi-head attention combines several such operations in parallel, and \(Y\) is the concatenation of the results along the feature dimension.

\(\Large\textbf{Note:}\)

  • The terminology of attention mechanism comes from the paradigm of key-value dictionaries for data storage in which objects (the values) are stored using a key.
  • Querying the database consists of matching a query with the keys of the database to retrieve the values associated to them.
  • This is why matrices \(Q\) and \(K\) have the same number of columns, that correspond to the dimension \(D\) of individual keys or queries because we computes matches between them. The matrices \(K\) and \(V\) have the same number of rows \(T'\) because each value is “indexed” by one key.
  • Each row \(Y_j\) of the output corresponds to a weighted average of the values modulated by how much the query matched the associated key.
  • \(\LARGE\star\) This is exactly what an attention layer would do: equip the model with the ability to combine information from parts of the signal that it actively identifies as relevant.

\(\text{batch matrix product}\): torch.matmul()

>>> a = torch.rand(11, 9, 2, 3)
>>> b = torch.rand(11, 9, 3, 4)
>>> m = a.matmul(b)
>>> m.size()
torch.Size([11, 9, 2, 4])
>>>
>>> m[7, 1]
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>> a[7, 1].mm(b[7, 1])
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>>
>>> m[3, 0]
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])
>>> a[3, 0].mm(b[3, 0])
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])

\(\text{Attention layer Code:}\)

class AttentionLayer(nn.Module):
    def __init__(self, in_channels, out_channels, key_channels):
        super().__init__()
        self.conv_Q = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_K = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_V = nn.Conv1d(in_channels, out_channels, kernel_size = 1, bias = False)
    
    def forward(self, x):
        Q = self.conv_Q(x)
        K = self.conv_K(x)
        V = self.conv_V(x)
        A = Q.transpose(1, 2).matmul(K).softmax(2)
        y = A.matmul(V.transpose(1, 2)).transpose(1, 2)
        return y

The computation of the attention matrix \(A\) and the layer’s output \(Y\) could also be expressed somehow more clearly with Einstein summations:

A = torch.einsum('nct,ncs->nts', Q, K).softmax(2)
y = torch.einsum('nts,ncs->nct', A, V)

Positional Encoding

>>> len = 20
>>> c = math.ceil(math.log(len) / math.log(2.0))
>>> o = 2**torch.arange(c).unsqueeze(1)
>>> pe = (torch.arange(len).unsqueeze(0).div(o, rounding_mode = 'floor')) % 2
>>> pe
tensor([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

3. Transformer Networks

\(\Large\text{Illustration: refer }\) Lecture-P2

\[\begin{aligned} \operatorname{Attention}(Q, K, V) &=\operatorname{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_{k}}}\right) V \\ \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(H_{1}, \ldots, H_{h}\right) W^{O} \\ H_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right), i=1, \ldots, h \end{aligned} \]

where

\[W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}, \quad W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}} \]

\(\textbf{Positional information:}\)

\[\begin{gathered} P E_{t, 2 i}=\sin \left(\frac{t}{10,000^{\frac{2 i}{d_{\text {model }}}}}\right) \\ P E_{t, 2 i+1}=\cos \left(\frac{t}{10,000^{\frac{2 i+1}{d_{\text {model }}}}}\right) . \end{gathered} \]

\(\Large\text{Overall Illustration: refer }\) Lecture-P5

BERT (Bidirectional Encoder Representation from Transformers, Devlin et al., 2018) is a transformer pre-trained with:

  • Masked Language Model (MLM), that consists in predicting [\(15\)% of] words which have been replaced with a “MASK” token.
  • Next Sentence Prediction (NSP), which consists in predicting if a certain sentence follows the current one.

\(\Large\text{Illustration: refer }\) Lecture-P14

\(\text{GPT: a transformer trained for auto-regressive text generation}\) Lecture-P18

We can use HuggingFace’s pre-trained models:

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

tokens = tokenizer.encode('Studying Deep-Learning is')

for k in range(100): # no more than 100 tokens
    outputs = model(torch.tensor([tokens])).logits
    next_token = torch.argmax(outputs[0, -1])
    tokens.append(next_token)
    if tokenizer.decode([next_token]) == '.': break

print(tokenizer.decode(tokens))

Vision Transformers

\(\Large\text{Illustration: refer }\) Lecture-P31

posted on 2022-06-08 00:09  Blackzxy  阅读(21)  评论(0编辑  收藏  举报