Temporal RoI Align for Video Object Recognition 解读

可以采用翻译软件翻译

Temporal RoI Align for Video Object Recognition

TL;DR

Goal: exploit temporal information for the same object instance in a video.

RPN -> proposals
proposal -> deformable attention along time axis -> aggregate temporal features to current frame
regress

Introduction

image-level information
- D&T, DFF, FGFA, MANet, STSN
- the performance of these methods degrades quickly with longer time interval

can only utilize nearby frames within 1 sec(30 frames)

proposal-level information?
- MANet, SELSA, Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection

ROI Align

Create uniform grids
Create 4 sampling points in each grid
Using Bilinear Interpolation

Temporal ROI Align

Extract features corresponding to target frame based on affine map, not positions in ROI regions in support frames

Notations

\(T\), number of supporting frames
\(F_{t} \in \mathbb{R}^{H\times W \times C}\), feature map(full image)
\(X_{t} \in \mathbb{R}^{h\times w \times C}\)
- ROI-aligned feature
- Note: ROI-align is the prerequisite to perform detection, which adaptively rescale the feature to suit CNN

Most Similar ROI Align(Top K + concatenation)

pixel-level

deformable align, based on SIMILARITY rather than BBOX REGION in original ROI-align

Input
- current ROI \(X_{t}\)
- feature maps of support frames \(\{F_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
Output
- \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\) ROI in every support frame

Temporal Feature Aggregation

How to use the T aligned feature blocks to help detection in this frame

query: \(X_{t}\)
key: \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
value: \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
multi-head
- split feature map to \(N \times \mathbf{F} \in \mathbb{R}^{h\times w\times \frac{C}{N}}\)
- apply \(N\) heads.

get an enhanced \(\bar{X}_{t}\)

Pipeline

RPN
ROI
Deformable ROI Align
Temporal Attention
Contextualized ROI feature

Experiments

Difference from Non-local Network

Non-local Operation works

It's essentially the same: introducing dynamic, non-local reception as big as whole image.

However, I think the problem lies in the target frame*

RPN cannot propose regions when encountering severe distortion
We should not assume that distortion can be verified only based on single-pixel affinity

posted @ 2022-07-21 09:58 ZXYFrank 阅读(198) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

ZXYFrank

Enjoy the process🍀

Temporal RoI Align for Video Object Recognition 解读

Temporal RoI Align for Video Object Recognition

TL;DR

Introduction

ROI Align

Temporal ROI Align

Notations

Most Similar ROI Align(Top K + concatenation)

Temporal Feature Aggregation

Pipeline

Experiments

Difference from Non-local Network

公告

Loading

ZXYFrank

Enjoy the process🍀

Temporal RoI Align for Video Object Recognition 解读

Temporal RoI Align for Video Object Recognition

TL;DR

Introduction

Related Works

ROI Align

Temporal ROI Align

Notations

Most Similar ROI Align(Top K + concatenation)

Temporal Feature Aggregation

Pipeline

Experiments

Difference from Non-local Network

公告