Attention机制-转载

转载自：http://www.cosmosshadow.com/ml/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/2016/03/08/Attention.html

Attention

Index

参考列表

Survey on Advanced Attention-based Models
Recurrent Models of Visual Attention (2014.06.24)
Recurrent Model of Visual Attention (blog)
https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rva.lua
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015.02.10)
Soft Attention Mechanism for Neural Machine Translation
DRAW: A Recurrent Neural Network For Image Generation (2015.05.20)
Teaching Machines to Read and Comprehend (2015.06.04)
Learning Wake-Sleep Recurrent Attention Models (2015.09.22)
Action Recognition using Visual Attention (2015.10.12)
Recurrent Convolutional Neural Network for Object Recognition (2015)
Understanding Deep Architectures using a Recursive Convolutional Network (2014.2.19)
MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION (2015.04.23)
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)
https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua (code)

Attention

在引入Attention(注意力)之前，图像识别或语言翻译都是直接把完整的图像或语句直接塞到一个输入，然后给出输出。
而且图像还经常缩放成固定大小，引起信息丢失。
而人在看东西的时候，目光沿感兴趣的地方移动，甚至仔细盯着部分细节看，然后再得到结论。
Attention就是在网络中加入关注区域的移动、缩放、旋转机制，连续部分信息的序列化输入。
关注区域的移动、缩放、旋转采用强化学习来实现。

Attention在视觉上的递归模型

参考 Recurrent Models of Visual Attention (2014.06.24)

模型

该模型称为The Recurrent Attention Model，简称RAM。

A、Glimpse Sensor: 在 $t$

步，图片 $x_{t}$

该模型每次迭代的时候，还可以输出缩放信息和结束标志。

$x_{t}$

训练

网络的参数可表示为 $θ = {θ_{g}, θ_{h}, θ_{a}, θ_{l}}$

$x_{t}$

输出类型是正确时，奖赏为1，否则为0。其它时刻的奖赏为0。
奖赏期望为

J (θ) = E p (s 1 : T; θ) [\sum

强化学习的目标是提高 $J$

$x_{t}$

。对其求导

\nabla θ (log J) = E p (s 1 : T; θ)

其中 $i = 1 \dots M$

$x_{t}$

次采样。

在学习训练过程中， $\nabla_{θ} \log π (u_{t}^{i} ∣ s_{1 : t}^{i}; θ)$

不需要显示求出，可直接使用RNN模型的标准反馈梯度。

以上等式是梯度的无偏估计，但可引起高方差，所以引入以下估计

1 M \sum i = 1 M \sum t = 1 T

其中 $b_{t} = E_{π} [R_{t}]$

$x_{t}$

效果

以上是论文中在扩大和污染了的minst数据库上，glimpse的移动方向。
实心绿点是开始，空心绿点是结束。
可以看到，RAM模型顺着感兴趣的方向移动。
识别效果比全链接的网络，和基于CNN的网络都要好。

$x_{t}$

$⟹$

A bird flying over a body of water

如上，根据图片，生成主题描述。

$x_{t}$

模型

如上图，模型把图片经过CNN网络，变成特征图。
LSTM的RNN结构在此上运行Attention模型，最后得到主题输出。

$x_{t}$

编码

特征图均匀地切割成多个区域，表示为

a = {a 1, \dots, a L}, a i \in R

L表示切割的区域个数。
如区域大小为 $14 \times 14$

$x_{t}$

。

输出的主题 $y$

可以编码为

y = {y 1, \dots, y C}, y i \in

K是字典的单词个数，C是句子长度。
$y_{i}$

$x_{t}$

处位置为1，其它位置为0。

$x_{t}$

解码

该模型使用的LSTM如下图所示

运算为

⎛ ⎝ ⎜ ⎜ ⎜ i t f t

c t = f t ⊙ c t - 1 + i

h t = o t ⊙ tanh (c t)

其中 $σ$

$x_{t}$

，通过随机初始化学习到的矩阵。

${\hat{z}}_{t}$

是对整张图片部分信息的动态表示，一个Attention模型，如下计算

e t i = f a t t (a i, h

α t i = exp ( e t i ) \sum L

z^t=ϕ({ai},{αt

其中 $i$

$x_{t}$

的不同实现上。
按(1)实现称为 Stochastic “Hard” Attention ，按(2)实现称为 Deterministic “Soft” Attention。
下图上一排为 soft 模型，下一排为 hard 模型。

LSTM中的记忆单元与隐藏单元的初始值，是两个不同的多层感知机，采用所有特征区域的平均值来进行预测的:

c 0 = f i n i t . c (1 L

h 0 = f i n i t . h (1 L

而最终的单词概率输出，采用深度输出层实现

p (y t ∣ a, y t - 1) \propto exp (L o

其中 $L_{o} \in R^{K \times m}$

$x_{t}$

。

$x_{t}$

Stochastic “Hard” Attention

$s_{t, i}$

$x_{t}$

变量如下计算

p (s t, i = 1 ∣ a) = α t, i

z^t = \sum i = 1 L s t,

我们设置 $\log p (y ∣ a)$

$x_{t}$

L s = \sum s p (s ∣

对其进行参数求导有

\partial L s \partial W = \sum s p ( s ∣ a )

以上参数求导可用Monte Carlo方法采样实现

s \sim t \sim M u l t i n o u l l i L ({α i

\partial L s \partial W \approx 1 N

为减少估计方差，可采用冲量方式，第k个 mini-batch 的时候

b k = 0.9 \times b k - 1 + 0.1 \times log p (y ∣ s \sim

为进一步减少估计方差，引入 multinoulli 分布的熵 $H (s)$

\partial L s \partial W \approx 1 N

$λ_{r}$

$x_{t}$

是两个超参。
以上参数求导优化的过程就是强化学习，每次选择下一个特征图的过程都朝目标更好的方向变化。

$x_{t}$

Deterministic “Soft” Attention

上面的随机模型需要采样位置 $s_{t}$

$x_{t}$

E p (s t ∣ a) [z^t]

这就是Deterministic “Soft” Attention模型，通过 $α$

来选择感兴趣的特征区域。
该模型可以通过端到端的的反馈方法进行学习。

在计算 $α$

$x_{t}$

步中，被观察的权重拉近:

\sum t α t, i \approx 1

这个正则的加入，可以使得生成的主题更加丰富。就是结果更好嘛！

另外，在 ${\hat{z}}_{t}$

$x_{t}$

来计算

E p (s t ∣ a) [z^t]

β t = σ (f β (h t - 1))

最终，端到端的目标函数可写为

L d = - log (P (y ∣ x)) + λ \sum i L (1 -

基于Attention的字符识别

参考 Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)

$x_{t}$

模型

$x_{t}$

Recursive / Recurrent CNN

CNN是卷积层权重共享。
Recursive CNN是在卷积层中添加多层，每层的卷积核共享:

h i, j, k (t) = {σ ((w h h

Recurrent CNN也是在卷积层中添加多层，但每层都在最初信息的参与，卷积核可以共享，也可能不共享:

h i, j, k (t) = σ ((w r k)

Recursive与Recurrent CNN有都提高感受野，减少参数的作用。
在参考这篇论文中，有提到Recursive CNN效果比Recurrent CNN好。

$x_{t}$

posted on 2016-10-18 10:02 Survival003 阅读(3338) 评论(0) 编辑收藏举报

刷新页面返回顶部

Survival