xinyu04

导航

Deep Learning Week12 Notes

1. Recurrent Neural Networks

Temporal Convolutional Networks

Such a model is a standard \(1\)d convolutional network, that processes an input of the maximum possible length.

RNN and backprop through time

The historical approach to processing sequences of variable size relies on a recurrent model which maintains a recurrent state updated at each time step.

with \(\mathscr{X} = \mathbb{R}^D\), given:

\[\Phi(:,w):\mathbb{R}^D\times \mathbb{R}^Q\rightarrow\mathbb{R}^Q \]

an input sequence \(x\in S(\mathbb{R}^D)\), an initial recurrent state \(h_0\in \mathbb{R}^Q\), the model
computes the sequence of recurrent states iteratively:

\[h_t = \Phi(x_t,h_{t-1};w) \]

A prediction can be computed at any time step from the recurrent state:

\[y_t = \Psi(h_t;w) \]

with a readout function:

\[\Psi(:,w):\mathbb{R}^Q\rightarrow\mathbb{R}^C \]

\(\large\text{Illustration: “backpropagation through time” see: }\) Lecture-P11

One-hot vectors:

>>> nb_symbols = 6
>>> s = torch.tensor([0, 1, 2, 3, 2, 1, 0, 5, 0, 5, 0])
>>> x = F.one_hot(s, num_classes = nb_symbols)
>>> x
tensor([[1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0]])

\(\textbf{Elman Network:}\) with \(h_0=0\), the update:

\[h_t = \text{ReLU}(W_{(x\ h)}x_t+W_{(h\ h)}h_{t-1}+b_{(h)}) \]

and the final prediction:

\[y_T = W_{(h\ y)}h_T+b_{(y)} \]

\(\text{Code:}\)


class RecNet(nn.Module):
    def __init__(self, dim_input, dim_recurrent, dim_output):
        super().__init__()
        self.fc_x2h = nn.Linear(dim_input, dim_recurrent)
        self.fc_h2h = nn.Linear(dim_recurrent, dim_recurrent, bias = False)
        self.fc_h2y = nn.Linear(dim_recurrent, dim_output)

    def forward(self, input):
        h = input.new_zeros(input.size(0), self.fc_h2y.weight.size(1))
        for t in range(input.size(1)):
            h = F.relu(self.fc_x2h(input[:, t]) + self.fc_h2h(h))
        return self.fc_h2y(h)

Gating

When unfolded through time, the model depth is proportional to the input length, and training it involves in particular dealing with vanishing gradients.

An important idea in the RNN models used in practice is to add in a form or another a pass-through, so that the recurrent state does not go repeatedly through a squashing non-linearity.

For instance, the recurrent state update can be a per-component weighted average of its previous value \(h_{t−1}\) and a full update \(\bar{h}_t\), with the weighting \(z_t\) depending on the input and the recurrent state, acting as a “forget gate”.

So the model has an additional “gating” output:

\[f:\mathbb{R}^D\times\mathbb{R}^Q\rightarrow [0,1]^Q \]

and the update rule takes the form:

\[\begin{align} \bar{h}_t &= \Phi(x_t,h_{t-1})\\ z_t &= f(x_t,h_{t-1})\\ h_t &= z_t\odot h_{t-1}+(1-z_t)\odot \bar{h}_t \end{align} \]

We can improve our minimal example with such a mechanism, replacing:

\[h_t = \text{ReLU}(W_{(x\ h)}x_t+W_{(h\ h)}h_{t-1}+b_{(h)}) \]

with

\[\begin{aligned} &\bar{h}_{t}=\operatorname{ReLU}\left(W_{(x h)}x_t+W_{(h h)} h_{t-1}+b_{(h)}\right) \quad \text { (full update) }\\ &z_{t}=\operatorname{sigm}\left(W_{(x z)} x_{t}+W_{(h z)} h_{t-1}+b_{(z)}\right) \quad \text { (forget gate) }\\ &h_{t}=z_{t} \odot h_{t-1}+\left(1-z_{t}\right) \odot \bar{h}_{t} \quad \text { (recurrent state) } \end{aligned} \]

\(\text{Code:}\)

class RecNetWithGating(nn.Module):
    def __init__(self, dim_input, dim_recurrent, dim_output):
        super().__init__()
        self.fc_x2h = nn.Linear(dim_input, dim_recurrent)
        self.fc_h2h = nn.Linear(dim_recurrent, dim_recurrent, bias = False)
        self.fc_x2z = nn.Linear(dim_input, dim_recurrent)
        self.fc_h2z = nn.Linear(dim_recurrent, dim_recurrent, bias = False)
        self.fc_h2y = nn.Linear(dim_recurrent, dim_output)

    def forward(self, input):
        h = input.new_zeros(input.size(0), self.fc_h2y.weight.size(1))
        for t in range(input.size(1)):
            z = torch.sigmoid(self.fc_x2z(input[:, t]) + self.fc_h2z(h))
            hb = F.relu(self.fc_x2h(input[:, t]) + self.fc_h2h(h))
            h = z * h + (1 - z) * hb
        return self.fc_h2y(h)

An intuitive explanation of this improvement is that if the gradient vanishes at one of the time steps, some information can still flow back through time thanks to the pass-through.

2. LSTM and GRU

The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber (1997), is a recurrent network that originally had a gating of the form:

\[c_t = c_{t-1}+i_t\odot g_t \]

where \(c_t\) is a recurrent state, \(i_t\) is a gating function and \(g_t\) is a full update. This assures that the derivatives of the loss w.r.t. \(c_t\) does not vanish.

In LSTM, the hidden state is called a “cell state”, and we note it \(c_t\).

\(\Large\textbf{Note:}\)
The recurrent state is composed of a “cell state\(c_t\) and an “output state\(h_t\). Gate \(f_t\) modulates if the cell state should be forgotten, \(i_t\) if the new update should be taken into account, and \(o_t\) if the output state should be reset:

\[\begin{aligned} &f_{t}=\operatorname{sigm}\left(W_{(x f)}x_t+W_{(h f)} h_{t-1}+b_{(f)}\right) \quad \text { (forget gate) }\\ &i_{t}=\operatorname{sigm}\left(W_{(x i)} x_{t}+W_{(h i)} h_{t-1}+b_{(i)}\right) \quad \text { (input gate) }\\ &g_{t}=\tanh \left(W_{(x c)}x_t+W_{(h c)} h_{t-1}+b_{(c)}\right) \quad \text { (full cell state update) }\\ &c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot g_{t} \quad \text { (cell state) }\\ &o_{t}=\operatorname{sigm}\left(W_{(x \circ) x_{t}}+W_{(h o)} h_{t-1}+b_{(o)}\right) \quad \text { (output gate) }\\ &h_{t}=o_{t} \odot \tanh \left(c_{t}\right) \quad \text { (output state) } \end{aligned} \]

As pointed out by Gers et al. (2000), the forget bias \(b_{(f)}\) should be initialized with large values so that initially \(f_t\approx 1\) and the gating has no effect.

Prediction is done from the \(h_t\) state, hence called the output state.

When several layers of LSTM are combined, the first layer takes as input the sequence \(x_t\) itself, while the next layer take as input the \(\textbf{output state}\) of the previous layer, the \(h_t\). \(\large\text{See: }\) Lecture-P6

\(\textbf{PyTorch:}\) torch.nn.LSTM:
Its processes several sequences, and returns two tensors, with \(D\) the number of layers and \(T\) the sequence length:

  • the outputs for all the layers at the last time step: \(h_T^1,...,h_T^D\)
  • the outputs of the last layer at each time step \(h_1^D,...,h_T^D\)

PyTorch’s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.PackedSequence

Such an object can be created with nn.utils.rnn.pack_padded_sequence

>>> from torch.nn.utils.rnn import pack_padded_sequence
>>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]],
...                                     [[ 3. ], [ 4. ]],
...                                     [[ 5. ], [ 0. ]]]),
...                      torch.tensor([3, 2]))
PackedSequence(data=tensor([[1.],
[2.],
[3.],
[4.],
[5.]]), batch_sizes=tensor([2, 2, 1]),
sorted_indices=None, unsorted_indices=None)

The LSTM were simplified into the Gated Recurrent Unit (GRU) by Cho et al. (2014), with a gating for the recurrent state, and a reset gate.

\[\begin{aligned} &r_{t}=\operatorname{sigm}\left(W_{(x r)} x_{t}+W_{(h r)} h_{t-1}+b_{(r)}\right) \quad \text { (reset gate) }\\ &z_{t}=\operatorname{sigm}\left(W_{(x z)} x_{t}+W_{(h z)} h_{t-1}+b_{(z)}\right) \quad \text { (forget gate) }\\ &\bar{h}_{t}=\tanh \left(W_{(x h)}x_t+W_{(h h)}\left(r_{t} \odot h_{t-1}\right)+b_{(h)}\right) \quad \text { (full update) }\\ &h_{t}=z_{t} \odot h_{t-1}+\left(1-z_{t}\right) \odot \bar{h}_{t} \quad \text { (hidden update) } \end{aligned} \]

The specific form of these units prevents the gradient from vanishing, but it may still be excessively large on certain mini-batch. The standard strategy to solve this issue is gradient norm clipping:

\[\widetilde{\nabla f}=\frac{\nabla f}{\|\nabla f\|} \min (\|\nabla f\|, \delta) \]

\(\textbf{PyTorch: }\) torch.nn.utils.clip_grad_norm

>>> x = torch.empty(10)
>>> x.grad = x.new(x.size()).normal_()
>>> y = torch.empty(5)
>>> y.grad = y.new(y.size()).normal_()
>>> torch.cat((x.grad, y.grad)).norm()
tensor(4.0303)
>>> torch.nn.utils.clip_grad_norm_((x, y), 5.0)
tensor(4.0303)
>>> torch.cat((x.grad, y.grad)).norm()
tensor(4.0303)
>>> torch.nn.utils.clip_grad_norm_((x, y), 1.25)
tensor(4.0303)
>>> torch.cat((x.grad, y.grad)).norm()
tensor(1.2500)

3. Word embeddings and translation

Let

\[k_t\in \{1,...,W \}, t = 1,...,T \]

be the training sequence of \(T\) words, encoded as IDs from a \(W\) words vocabulary.

Given a embedding dimension \(D\), the objective is to learn vectors:

\[E_k\in \mathbb{R}^D,k\in \{1,...,W\} \]

so that “similar” words are embedded with “similar” vectors.

A common word embedding is the Continuous Bag of Words (CBOW) version of word2vec.

\(\large\textbf{Note: }\)In this model, the embedding vectors are chosen so that a word can be [linearly] predicted from the sum of the embeddings of words around it.

Formally, let \(C\in \mathbb{N}^*\) be the "context size", and:

\[\mathscr{C}_t = (k_{t-C},...,k_{t-1},k_{t+1},...,k_{t+C}) \]

be the “context” around \(k_t\). The embedding vectors \(E_k\in \mathbb{R}^D,k = 1,...,W\) are jointly optimized with an array:

\[M\in\mathbb{R}^{W\times D} \]

so that the vector of scores:

\[\psi(t) = M\sum_{k\in\mathscr{C}_t}E_k\in \mathbb{R}^W \]

is a great predictor of the value of \(k_t\).

Ideally we would minimize the cross-entropy between the vector of scores \(\psi(t) \in\mathbb{R}^W\) and the class \(k_t\):

\[\sum_{t}-\log \left(\frac{\exp \psi(t)_{k_t}}{\sum_{k=1}^{W} \exp \psi(t)_{k}}\right) \]

Negative Sampling

The “negative sampling” approach uses the prediction for the correct class \(k_t\) and only \(Q\llless W\) incorrect classes \(k_{t,1},...,k_{t,Q}\) sampled at random.

In our implementation we take the later uniformly in \(\{1,...,W\}\), and use the same loss:

\[\sum_{t}\left(\log \left(1+e^{-\psi(t)_{k_{t}}}\right)+\sum_{q=1}^{Q} \log \left(1+e^{\psi(t)_{\kappa_{t, q}}}\right)\right) \]

We want \(\psi(t)_{k_t}\) to be large and all the \(\psi(t)_{\kappa_{t,q}}\) to be small.

Embedding

The PyTorch module nn.Embedding does precisely that. It is parametrized with a number \(N\) of words to embed, and an embedding dimension \(D\).

>>> e = nn.Embedding(num_embeddings = 10, embedding_dim = 3)
>>> x = torch.tensor([[1, 1, 2, 2], [0, 1, 9, 9]])
>>> y = e(x)
>>> y.size()
torch.Size([2, 4, 3])
>>> y
tensor([[[ 1.3179, -0.0637, 0.9210],
[ 1.3179, -0.0637, 0.9210],
[ 0.2491, -0.8094, 0.1276],
[ 0.2491, -0.8094, 0.1276]],
[[ 1.2158, -0.4927, 0.4920],
[ 1.3179, -0.0637, 0.9210],
[ 1.1499, -0.9049, 0.6532],
[ 1.1499, -0.9049, 0.6532]]], grad_fn=<EmbeddingBackward0>)
>>> e.weight[1]
tensor([ 1.3179, -0.0637, 0.9210], grad_fn=<SelectBackward0>)

In this example, we consider an embedding that maps \(10\) words to a space of dimension \(3\). The words are referred to with their index, between \(0\) and \(9\).

Our CBOW model has as parameters two embeddings:

\[E\in\mathbb{R}^{W\times D},M\in\mathbb{R}^{W\times D} \]

Its forward gets as input a pair \((c, d)\) of integer tensors corresponding to a batch of size \(B\):

  • \(c\) of size \(B × 2C\) contains the IDs of the words in a context, and
  • \(d\) of size \(B × R\) contains the IDs, for each of the \(B\) contexts, of \(R\) words for which we want predicted scores.

it returns a tensor \(y\) of size \(B × R\) containing the dot products:

\[y[n, j]=\frac{1}{D} M_{d[n, j]} \cdot\left(\sum_{i} E_{c[n, i]}\right) \]

class CBOW(nn.Module):
    def __init__(self, voc_size = 0, embed_dim = 0):
        super().__init__()
        self.embed_dim = embed_dim
        self.embed_E = nn.Embedding(voc_size, embed_dim)
        self.embed_M = nn.Embedding(voc_size, embed_dim)

    def forward(self, c, d):
        sum_w_E = self.embed_E(c).sum(1, keepdim = True).transpose(1, 2)
        w_M = self.embed_M(d)
        return w_M.bmm(sum_w_E).squeeze(2) / self.embed_dim

Regarding the loss, we can use nn.BCEWithLogitsLoss which implements:

\[\sum_{t} y_{t} \log \left(1+\exp \left(-x_{t}\right)\right)+\left(1-y_{t}\right) \log \left(1+\exp \left(x_{t}\right)\right) \]

\(\Large\text{Illustration: }\) Lecture-P19

An alternative algorithm is the skip-gram model, which optimizes the embedding so that a word can be predicted by any individual word in its context. The skip-gram model aims at predicting the embedding vectors of the word around given the embedding vector of the word in the middle.

posted on 2022-06-06 05:36  Blackzxy  阅读(25)  评论(0编辑  收藏  举报