『论文笔记』Exploiting Visual Semantic Reasoning for Video-Text Retrieval




Previous works have been devoted to representing videos by directly encoding from frame-level features. 


After sampling frames, our model detects frame regions by bottom-up attention [Anderson et al., 2018] and extract region features. In this way, each frame can be represented by several regions. Specifically, the bottom-up attention module is implemented with Faster RCNN [Ren et al., 2015] pre-trained on Visual Genome [Krishna et al., 2017], an image region relation annotated dataset.


 The random walks are a rule for vertices to access their neighbors. The transition probability is determined by the weights of edges. 


We utilize the bottom-up attention model to generate frame regions and extract features from frames (Sec. 3.1).

For the regions, we construct a graph to model semantic correlations (Sec. 3.2).

Subsequently, we do semantic reasoning between these regions by leveraging random walk rule based Graph Convolutional Networks (GCN) to generate region features with relation information (Sec. 3.3).

Finally, video and text features are generated, and the whole model is trained with common space learning (Sec. 3.4).

对某帧内的对象相关度(semantic relations),计算公式如下,其中每一帧提取n个对象,每个对象使用d维特征(pool5):







posted @ 2020-10-18 15:30  叠加态的猫  阅读(388)  评论(0编辑  收藏  举报