JDE Towards Real-Time Multi-Object Tracking 英文解读

SDE methods bring critical challenges in building a real-time MOT system

Background

Faster RCNN = Fast RCNN + RPN

Seperate Detection and Embedding

Detector -> Cropped Image -> ReID model -> reid feature

Two-stage

RPN -> Detection -(sharing feature map)-> reid embedding

Joint Detection and Embedding

Algorithm

\[\mathbf{F}_{i} = \text{Head}(\text{FPN}_{i}) \]

Design/Training

Detection

Anchor
modified from original RPN/Faster RCNN
adapted for MOT task
all anchors are set to an aspect ratio of 1 : 3.

ReID

Contrasive Learning

The margin term is neglected for convenience.

looking at a mini-batch and mining all the negative samples \(f^{-}_{i}\) and the hardest positive sample \(f^{+}\) in this mini-batch
\(f^{T}\) is the selected anchor in the batch.

this is the upper bound of triplet loss

this is the cross entropy loss \(\mathcal{L} = \sum_{c =1}^{\text{Cls.}}\mathbb{I}(y_{i} = c)\log p(f(x) =c))\) with \(p = \text{Softmax}(g^{+},\{g^{-}\})\)

Multi-task training

\(M\) is the number of prediction heads.

+ Question: So, each feature map at different scale in FPN is trained. 
+ But during inference, which feature map can we use?
+ Or rather, should we design a strategy to 
+ further fuse the predictions AT DIFFERENT SCALES?

ref:Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.

we employ task-dependent uncertainty [16] to dynamically weight the heterogenous losses.

a metric learning problem

Inference

Get Embedding

+ Question: It seems that FPN with the heads is not clearly described in the paper.

Association

\[\mathbf{T}_{i} = \{e_{i},m_{i}\} \]

\(e_{i}\) is appearance state
- \[e_{i}^{t} = \alpha e^{t-1}_{i} + (1-\alpha)f_{i}^{t} \]
- \(f_{i}^{t}\) is the appearance embedding
\(m_{i}^{t}\) is maintained by Kalman Filter

using Hungarian algorithm for linking

\[\text{Cost} = \lambda A_{e} + (1 - \lambda) A_{m} \]

Experiments

One may notice that JDE has a lower IDF1 score and more ID switches than existing methods. At first we suspect the reason is that the jointly learned embedding might be weaker than a separately learned embedding.

However, when we replace the jointly learned embedding with the separately learned embedding, the IDF1 score and the number of ID switches remain almost the same.

posted @ 2022-03-23 19:33 ZXYFrank 阅读(68) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

ZXYFrank

Enjoy the process🍀