Visual Question Answering with Memory-Augmented Networks

Visual Question Answering with Memory-Augmented Networks
2018-05-15 20:15:03

Paper: CVPR-2018

1. Background and Motivation：

虽然 VQA 已经取得了很大的进步，但是这种方法依然对完全 general，freeform VQA 表现很差，作者认为是因为如下两点：

　　1. deep models trained with gradient based methods learn to respond to the majority of training data rather than specific scarce exemplars ;

　　用梯度下降的方法训练得到的深度模型，对主要的训练数据有较好的相应，但是对特定的稀疏样本却不是；

　　2. existing VQA systems learn about the properties of objects from question-answer pairs, sometimes independently of the image.

　　选择性的关注图像中的某些区域是很重要的策略。

我们从最近的 memory-augmented neural networks 以及 co-attention mechanism 得到启发，本文中，我们利用 memory-networks 来记忆 rare events，然后用 memory-augmented networks with attention to rare answers for VQA.

2. The Proposed Algorithm :

本文的算法流程如上图所示，首先利用 embedding 的方法，提取问题和图像的 feature，然后进行 co-attention 的学习，然后将两个加权后的feature进行组合，然后输入到 memory network 中，最终进行答案的选择。

Image Embedding：用 pre-trained model 进行特征的提取；

Question Embedding：用双向 LSTM 网络进行语言特征的学习；

Sequential Co-attention：

这里的协同 attention 机制，考虑到图像和文本共同的特征，相互影响，得到共同的注意力机制。我们根据视觉特征和语言特征的平均值，进行点乘，得到一个 base vector m0 ：

我们用一个两层的神经网络进行 soft attention 的计算。对于 visual attention，the soft attention 以及加权后的视觉特征向量分别为：

其中 Wv， Wm，Wh 都表示 hidden states。类似的，我们计算加权后的问题特征向量，如下：

我们将加权后的 v 和 q 组合，用来表示输入图像和问题对，图4，展示了 co-attention 机制的整个过程。

Memory Augmented Network：

The RNNs lack external memory to maintain a long-term memory for scarce training data. This paper use a memory-augmented NN for VQA.

特别的，我们采用了标准的 LSTM 模型作为 controller，起作用是 receives input data，然后跟外部记忆模块进行交互。外部记忆，Mt，是有一系列的 row vectors 作为 memory slots。

xt 代表的是视觉特征和文本特征的组合；yt 是对应的编码的问题答案（one-hot encoded answer vector）。然后将该 xt 输入到 LSTM controller，如：

对于从外部记忆单元中读取，我们将 the hidden state ht 作为 Mt 的 query。首先，我们计算搜索向量 ht 和记忆中每一行的余弦距离：

然后，我们通过 the cosine distance 用 softmax 计算一个 read weight vector wr：

有这些 read-weights, 一个新的检索的记忆 rt 可以通过下面的式子得到：

最后，我们将 the new memory vector rt 和 controller hidden state ht 组合，然后产生 the output vector ot for learning classifier.

我们采用 the usage weights wu 来控制写入到 memory。我们通过衰减之前的 state 来更新 the usage weights ：

为了计算 the write weights，我们引入一个截断机制来更新 the least-used positions。此处，我们采用 m(v, n) 来表示 the n-th smallest element of a vector v. 我们采用 a learnable sigmoid gate parameter 来计算之前的 read weights 和 usage weights 的 convex combination：

A larger n results in maintaining a longer term of memory of scarce training data. 跟 LSTM 内部的记忆单元相比，这里的两个参数都可以用来调整 the rate of writing to exernal memory. 这给我们更多的自由来调整模型的更新。公式（12）中输出的隐层状态 ht 可以根据 the write weights 写入到 memory 中：

Answer Reasoning：

有了 the hidden state ht 以及那个外部记忆单元中得到的 the reading memory rt，我们将这两个组合起来，作为当前问题和图片的表达，输入到分类网络中，然后得到答案的分布。

--- Done ！

posted @ 2018-05-15 20:58 AHU-WangXiao 阅读(632) 评论(0) 收藏举报

刷新页面返回顶部

The Blog of Xiao Wang

Associate Professor, School of Computer Science and Technology, Anhui University, Email: xiaowang@ahu.edu.cn

Visual Question Answering with Memory-Augmented Networks

公告