SOTMOT-Improving Multiple Object Tracking with Single Object Tracking 英文详细解读
takeaways
In fact, it is no doubt that a multiple object tracker can be realized with multiple single ones
Background
The spirit of our approach, that learning auxiliary associative embeddings simultaneously with the main task, also shows good performance in many other vision tasks
SOT and MOT
- SOT
- discriminate
target
fromlocal backgrounds
- discriminate
- MOT
- because most backgrounds can be filtered out by the detector.
If we integrate SOT into MOT directly.
- inappropriate/overdone discrimination
- multiple targets will make SOT really slow
Related Works
MOT
one-shot and tracking-by-detection
- JDE
YOLO
(Anchor-based)
- FairMOT
ResNet-34 + DLA
- ReID branch
Model
Architecture
😂 inconsistent symbol
Backbone
DLA-34
SOT Branch
figure of SOT branch
+ Question: Performing SOT only use a branch??
The SOT branch trains a separate SOT model per target in one frame and locates the targets in another frame
- take in \(\mathbf{F}_{backbone}\)
- \(\mathbf{F}_{SOT} \in \R^{C_{SOT} \times H \times W} = \text{3-Convs}(\mathbf{F}_{backbone})\)
- 3x3, stride = 1, BN & ReLU
- given center \(\mathbf{c} = \{x,y\}\)
- \(\mathbf{F}_{object} = \mathbf{F}_{SOT}(x,y) \in \R^{C_{SOT}}\)
- index-based entry extraction
Train
- given centers in a training image \(\mathbf{C}_{targets} = \{\mathbf{c}\}_{i= 1}^{N}\)
- calculate a neighborhood matrix
- select the neighbors to construct data for
CLASSIFICATION
- \(\mathbf{X} = \{\mathbf{x}_{j}|\forall j:\mathbf{A}_{i,j} = 1\}\)
- ridge regression to obtain \(\mathbf{w}^{*}\)
-
-
the dimension is fixed. so during training. different \(\mathbf{X}_{i}^{\top}\mathbf{X}_{i}\) can be calculated in batch manner.
-
+ Question: What to Train?
Train a CNN
- 2 images form a pair
- backbone and heads
- fuse like CenterNet
1x1
Conv
We use the model pre-trained on COCO [24] to initialize the weights of backbone network and finetune them during offline training.
+ Question: If the branch is so simple,
+ how can it benefit from the SOTA VOT Trackers?
Inference
at timestep \(t\)
- we have living tracks \(\mathcal{T}_{i=1}^{M}\) until time \(t\)
where
- use kalman filter to predict current location
- perform
CenterNet
-like Detection - and obtain SOT features
- construct neighbors
SOT featrues from centers
link det. to pred.
- match to existing tracks
appearance metric
\(\mathbf{w}_{i}^{*}\) is the discriminator of Track i
only consider the neighbors of a target track
motion metric(Kalman Distance)
fuse the score
- 1st: Hungarian
and \(\mathbf{C}_{det}^{t}\) using \(\mathbf{v}\)
\(\mathbf{P}\): matched
\(\mathbf{Q}\): unmatched tracks
\(\mathbf{K}\): unmatched detections
- 2nd: Hungarian
IOU
\(\mathbf{Q}\): unmatched tracks
\(\mathbf{K}\): unmatched detections
insert into \(\mathbf{P}\)
- Updating \(\mathcal{T}_{i} = \{\mathbf{c}_{i}^{\tau},((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*})\}_{\tau = s}^{t-1}\)
for existing tracks
for new tracks, append \((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*}\)
for unmatched tracks, keep for 30 frames
Discussions/Experiments
How to use public detections
Similar to Tracktor
and CenterTrack
SOT/specific v.s. general/ReID
refer to paper 5.3 Ablation Study
- specific discrimination (SD)
- rather than the general discrimination (GD)
specific discrimination enhance tracking in crowded scenes.
coarse annotation 😂
Efficiency of Model-per-target
Thanks to GPU 😃
Conclusion
NFL
FairMOT
- General reID
- total offline trainning
- MOT17 sparse scene
SOTMOT
- Neighbor ReID
- offline trainning
- online training
- perform well on MOT20(crowd scene)
+ Question: How can it benefit from the SOTA VOT trackers?
It cannot!
本文来自博客园,作者:ZXYFrank,转载请注明原文链接:https://www.cnblogs.com/zxyfrank/p/16051832.html