Loading

SOTMOT-Improving Multiple Object Tracking with Single Object Tracking 英文详细解读

\[MOT \neq SOT \times N \]

takeaways


In fact, it is no doubt that a multiple object tracker can be realized with multiple single ones

Background

The spirit of our approach, that learning auxiliary associative embeddings simultaneously with the main task, also shows good performance in many other vision tasks

SOT and MOT

  • SOT
    • discriminate target from local backgrounds
  • MOT
    • because most backgrounds can be filtered out by the detector.

If we integrate SOT into MOT directly.

  • inappropriate/overdone discrimination
  • multiple targets will make SOT really slow
MOT
one-shot and tracking-by-detection

  • JDE
    • YOLO(Anchor-based)
  • FairMOT
    • ResNet-34 + DLA
    • ReID branch

Model

Architecture

😂 inconsistent symbol

Backbone

DLA-34

SOT Branch

figure of SOT branch

+ Question: Performing SOT only use a branch?? 

The SOT branch trains a separate SOT model per target in one frame and locates the targets in another frame

  • take in \(\mathbf{F}_{backbone}\)
    • \(\mathbf{F}_{SOT} \in \R^{C_{SOT} \times H \times W} = \text{3-Convs}(\mathbf{F}_{backbone})\)
    • 3x3, stride = 1, BN & ReLU
  • given center \(\mathbf{c} = \{x,y\}\)
    • \(\mathbf{F}_{object} = \mathbf{F}_{SOT}(x,y) \in \R^{C_{SOT}}\)
    • index-based entry extraction

Train

  • given centers in a training image \(\mathbf{C}_{targets} = \{\mathbf{c}\}_{i= 1}^{N}\)
    • calculate a neighborhood matrix

\[\mathbf{A}_{i,j} = \begin{align*} \begin{cases} &1 & \text{if} \min(x_{i}-x_{j},y_{i}-y_{j}) \leq r_{neighbor} \\& 0 & \text{otherwise} \end{cases} \end{align*}\]

  • select the neighbors to construct data for CLASSIFICATION
  • \(\mathbf{X} = \{\mathbf{x}_{j}|\forall j:\mathbf{A}_{i,j} = 1\}\)
  • ridge regression to obtain \(\mathbf{w}^{*}\)
    • the dimension is fixed. so during training. different \(\mathbf{X}_{i}^{\top}\mathbf{X}_{i}\) can be calculated in batch manner.

+ Question: What to Train? 

Train a CNN

  • 2 images form a pair
    • backbone and heads
    • fuse like CenterNet 1x1 Conv

We use the model pre-trained on COCO [24] to initialize the weights of backbone network and finetune them during offline training.

+ Question: If the branch is so simple, 
+ how can it benefit from the SOTA VOT Trackers?

Inference

at timestep \(t\)

  • we have living tracks \(\mathcal{T}_{i=1}^{M}\) until time \(t\)

where

\[\mathcal{T}_{i} = \{\mathbf{c}_{i}^{\tau},((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*})\}_{\tau = s}^{t-1} \]

  • use kalman filter to predict current location

\[\mathbf{M} = \mathbf{C}_{pred}^{t} = \{\hat{\mathbf{c}}_{i}^{t}\}_{i=1}^{M} \]

  • perform CenterNet-like Detection
  • and obtain SOT features

\[\mathbf{F}_{backbone} = \text{DLA-34}(\text{Image}) \]

\[\mathbf{F}_{SOT} \in \R^{128\times H \times W} = \text{SOTHead[3-Conv]}(\mathbf{F}_{backbone}) \]

\[\mathbf{C}_{det}^{t},\mathbf{S}_{det}^{t} = \text{CenterNet}(\mathbf{F}_{backbone}) \]

\[\mathbf{N} = \mathbf{C}_{det}^{t} = \{\hat{\mathbf{c}}_{i}^{t}\}_{i=1}^{M} \]

  • construct neighbors

SOT featrues from centers

\[\mathbf{X}_{t}(\mathbf{Z}_{t}) = \mathbf{F}_{SOT}[\mathbf{C}_{pred}^{t}] \]

link det. to pred.

\[\mathbf{Z}_{i}^{t} = \mathbf{F}_{SOT} [\text{Neighbor}(\mathbf{C}_{pred}^{t},\mathbf{C}_{det}^{t})] \]

\[\mathbf{Z}_{i}^{t} = \]

  • match to existing tracks

appearance metric

\[\mathbf{v}_{i} = \mathbf{Z}_{i}^{t}\mathbf{w}_{i}^{*} \]

\(\mathbf{w}_{i}^{*}\) is the discriminator of Track i
only consider the neighbors of a target track

motion metric(Kalman Distance)

fuse the score

  • 1st: Hungarian

\[\mathcal{T}_{i} = \{\mathbf{c}_{i}^{\tau},((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*})\}_{\tau = s}^{t-1} \]

and \(\mathbf{C}_{det}^{t}\) using \(\mathbf{v}\)

\(\mathbf{P}\): matched
\(\mathbf{Q}\): unmatched tracks
\(\mathbf{K}\): unmatched detections

  • 2nd: Hungarian

IOU

\(\mathbf{Q}\): unmatched tracks
\(\mathbf{K}\): unmatched detections

insert into \(\mathbf{P}\)

  • Updating \(\mathcal{T}_{i} = \{\mathbf{c}_{i}^{\tau},((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*})\}_{\tau = s}^{t-1}\)

for existing tracks

for new tracks, append \((\mathbf{X}_{i}^{\tau},\mathbf{y}_{i}^{\tau}),\mathbf{w}_{i}^{t*}\)

for unmatched tracks, keep for 30 frames


Discussions/Experiments

How to use public detections

Similar to Tracktor and CenterTrack

SOT/specific v.s. general/ReID

refer to paper 5.3 Ablation Study

  • specific discrimination (SD)
  • rather than the general discrimination (GD)
specific discrimination enhance tracking in crowded scenes.

coarse annotation 😂

Efficiency of Model-per-target

Thanks to GPU 😃

Conclusion

NFL

  • FairMOT
    • General reID
    • total offline trainning
    • MOT17 sparse scene
  • SOTMOT
    • Neighbor ReID
    • offline trainning
    • online training
    • perform well on MOT20(crowd scene)
+ Question: How can it benefit from the SOTA VOT trackers? 

It cannot!

posted @ 2022-03-24 20:52  ZXYFrank  阅读(343)  评论(0编辑  收藏  举报