Loading

Temporal RoI Align for Video Object Recognition 解读

可以采用翻译软件翻译

Temporal RoI Align for Video Object Recognition

TL;DR

  • Goal: exploit temporal information for the same object instance in a video.

  • RPN -> proposals
  • proposal -> deformable attention along time axis -> aggregate temporal features to current frame
  • regress

Introduction

  • image-level information
    • D&T, DFF, FGFA, MANet, STSN
    • the performance of these methods degrades quickly with longer time interval

can only utilize nearby frames within 1 sec(30 frames)

  • proposal-level information?
    • MANet, SELSA, Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection

ROI Align

  • Create uniform grids
  • Create 4 sampling points in each grid
  • Using Bilinear Interpolation

Temporal ROI Align

Extract features corresponding to target frame based on affine map, not positions in ROI regions in support frames

Notations

  • \(T\), number of supporting frames
  • \(F_{t} \in \mathbb{R}^{H\times W \times C}\), feature map(full image)
  • \(X_{t} \in \mathbb{R}^{h\times w \times C}\)
    • ROI-aligned feature
    • Note: ROI-align is the prerequisite to perform detection, which adaptively rescale the feature to suit CNN

Most Similar ROI Align(Top K + concatenation)

pixel-level

deformable align, based on SIMILARITY rather than BBOX REGION in original ROI-align

  • Input
    • current ROI \(X_{t}\)
    • feature maps of support frames \(\{F_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
  • Output
    • \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\) ROI in every support frame

Temporal Feature Aggregation

How to use the T aligned feature blocks to help detection in this frame

  • query: \(X_{t}\)
  • key: \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
  • value: \(\{X_{t+i}\}_{i = -\frac{T}{2}}^{\frac{T}{2}}\)
  • multi-head
    • split feature map to \(N \times \mathbf{F} \in \mathbb{R}^{h\times w\times \frac{C}{N}}\)
    • apply \(N\) heads.

get an enhanced \(\bar{X}_{t}\)

Pipeline

  • RPN
  • ROI
  • Deformable ROI Align
  • Temporal Attention
  • Contextualized ROI feature

Experiments

Difference from Non-local Network

Non-local Operation works

It's essentially the same: introducing dynamic, non-local reception as big as whole image.

However, I think the problem lies in the target frame*

  • RPN cannot propose regions when encountering severe distortion
  • We should not assume that distortion can be verified only based on single-pixel affinity
posted @ 2022-07-21 09:58  ZXYFrank  阅读(198)  评论(0编辑  收藏  举报