Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 2015 ) - LZ_Jaja

Link of the Paper: https://ieeexplore.ieee.org/document/7298856/

A Correlative Paper: Learning a Recurrent Visual Representation for Image Caption Generation (Link of the Paper: https://arxiv.org/abs/1411.5654)

Main Points:

A bi-directional mapping model using recurrent neural networks: unlike previous approaches which map both sentences and images to a common embedding ( and then calculate the similarity and match / generate, I guess ) that may be used for image search or for ranking image captions.
A bi-directional representation: generates both novel descriptions from images and visual representations from descriptions.
A novel recurrent visual memory: automatically learns to remember long-term visual concepts.
A set of latent variables U_t-1 that encodes the visual interpretation of the previously generated or read words W_t-1. Using U, our goal is to compute P(w_t | V, W_t-1, U_t-1) and P(V | W_t-1, U_t-1). Combining these two likelihoods together our global objective is to maximize, P(w_t, V | W_t-1, U_t-1) = P(w_t | V, W_t-1, U_t-1)P(V | W_t-1, U_t-1). That is, we want to maximize the likelihood of the word w_t and the observed visual features V given the previous words and their visual interpretation. Note that in previous papers, the objective was only to compute P(w_t | V, W_t-1) and not P(V | W_t-1).

Other Key Points:

Previous approaches project both semantics and visual features to a common embedding, they are not able to perform the inverse projection. That is, they cannot generate novel sentences or visual depictions from the embedding.

posted on 2018-08-16 15:24 LZ_Jaja 阅读(270) 评论(0) 编辑收藏举报

刷新页面返回顶部