『论文笔记』Learning a Text-Video Embedding from Incomplete and Heterogeneous Data






One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources.


提出Mixture-of-Embedding-Experts (MEE) model,可以处理缺失一部分信息的“视频”,将之正常的与文本进行匹配,增加训练集大小。




 文本先经过NetVLAD提取特征(This is motivated by the recent results [34] demonstrating superior performance of NetVLAD aggregation over other common aggregation architectures such as long short-term memory (LSTM) [48] or gated recurrent units (GRU) [49].),然后文本经过下面的映射:



The second layer, given by (2), performs context gating [34], where individual dimensions of Z1 are reweighted using learnt gating weights σ(W2Z1 + b2) with values between 0 and 1, where W2 and b2 are learnt parameters.

The motivation for such gating is two-fold: (i) we wish to introduce nonlinear interactions among dimensions of Z1 and (ii) we wish to recalibrate the strengths of different activations of Z1 through a self-gating mechanism. Finally, the last layer, given by (3), performs L2 normalization to obtain the final output Z.


[34] Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)


作者还介绍了相似度计算过程,很简单,值得一提的是,不同的视频源描述符(different streams of input descriptors)的权重完全由句子计算,作者认为句子可以作为先验决定描述的视频更侧重哪方面——这也是种自注意力机制:







posted @ 2020-09-20 23:16  叠加态的猫  阅读(917)  评论(0编辑  收藏  举报