Physical Adversarial Textures That Fool Visual Object Tracking

Physical Adversarial Textures That Fool Visual Object Tracking

2020-02-24 23:35:19

Paper: ICCV-2019

Related Works:

1). SPARK: Spatial-aware Online Incremental Attack Against Visual Tracking [Paper]

2). Defense-GAN: Protecting Classifiers against Adversarial Attacks using Generative Models [Paper] [Code]

3). STA: Adversarial Attacks on Siamese Trackers [Paper]

4). Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet [Paper]

1. Background and Motivation:

本文是首次将对抗攻击技术用于攻击 visual tracker。本文提出一种对抗纹理（Physical Adversarial Textures (PAT)），并且成功的攻击了基于回归的跟踪器 GOTURN。作者认为攻击一个 tracking 系统相比与 classification 或者 detection models 是更加有挑战的。因为 tracker 会适应物体外观的变换，一个 adversary 必须能够在物体移动和变化后足够 general。此外，像 GOTURN 这种 tracker 仅仅考虑到整帧的一个子区域，所以，only a small part of the PAT may be in view and not obstructed, yet it must still be potent。另外，在任何一帧仅仅让 BBox 稍微的偏差一些，并不能让 tracker 失效。鲁棒的 adversaries 必须能够让跟踪系统无法很好地跟住目标物体。

总体来说，本文的贡献可以分为如下几个方面：

1）首次展示了 adversaries 对序列跟踪任务的影响，影响了诸如视频监控，无人机领域，以及自动驾驶；

2）提出了“Guided adversarial losses” 的概念，which strikes a middle-ground between targeted and non-targeted adversarial objectives, 实验证明可以增强收敛性和对抗强度；

3）研究了 Expectation Over Transformation (EOT)，

4）用 non-photorealistic simulator and diffuse-only materials 展示了 PATs的 sim-to-real 的迁移能力；

2. Object Tracking Networks:

现阶段，很多基于学习方法的跟踪算法被设计出来，例如 GOTURN。其他的跟踪方法，如基于特征空间交叉滤波的方法，tracking-by-detection 的方法也可以，但是本文聚焦于 GOTURN 模型来验证本文方法的有效性。

如图2 所示，给定一个目标的在第 $f_{j-1}$ 帧的 BBox 位置，GOTURN 从该帧中抠出模板图像块。在当前帧也会抠出一个搜索区域来进行目标的定位。这两个图像块都会被resize 成 227*227，然后输入到 CNN 中；然后得到的 feature 进行拼接后，输入到 fc 层得到当前帧预测的目标位置。对于 GOTURN 的具体细节可以参考博文。

3. Attacking Regression Networks:

对于分类任务，一个对抗样本被定义为：a slightly-perturbed version of a source image，并且需要满足两个条件：

adversarial output --- 被攻击的模型将正确的 label 误分类了。

perceptual similarity --- 生成的对抗样本要人眼看起来和 source image 几乎无差别。

当攻击 regression 任务的时候，作者讨论了一些必要的调整。当然前人已经有相关的对抗的算法来攻击 regression task，但是仍然缺乏对抗样本强度和属性的相关分析（原文说的是: there is still a general lack of analysis on the strength and properties of adversaries a function of different attack objectives）。在这个工作中，作者考虑了不同的方法来优化，并且形成了一种新的 guided adversarial losses.

3.1. Adversarial Strength：

Typically, a regression output is characterized as adversarial by thresholding a task-specific error metric。该 metric 可能被用于度量对抗强度（adversarial strength）。例如，对于人体姿态估计的对抗可以被量化为：预测的结果和真值之间的误差比例。当愚弄 visual tracker 的时候，最终的目标是随着时间破坏跟踪目标的位置。所以，我们考虑一端视频 F = [f1, f2, ... , fN]，其中目标是在包含对抗纹理 X 的视频上进行移动，并且通过量化 tracker prediction 和 GT 之间的重合度来量化对抗强度。在这个文章中，作者将对抗强度定义为：averaging the mean-Intersection-Over-Union-difference metric, $\mu IOU d$：

3.2. Perceptual Similarity:

感知相似性通常用 source image 和 perturbed variant 之间的距离来衡量，即：using Euclidean norm in the RGB colorspace。利用本文的工作，作者想提出一种认知：外观的艺术对于视觉模型来说是有害的（colorful-looking art can be harmful to vision model）。

3.3. Optimizing for Adversarial Behaviors:

本文利用了多种损失函数来优化对抗样本：

1). the baseline non-targeted loss: maximizes the victim model's training loss, thus causing it to become generally confused;

2). targeted losses also apply the victim model's training loss, but to minimize the distance to an adversarial target output;

3). guided losses: middle-grounds between non-targeted loss and targeted losses;

4). hybrid losses: use a weighted linear combination of the above losses to gain adversarial strength and speed up the attack.

为了愚弄 object tracker，作者考虑了这些特定的损失：

其中，Lps 是 Lagrangian-relaxed loss，作者用这个 loss 主要是增强 perceptual similarity。

4. Physical Adversarial Textures:

在本节，作者主要是讨论了如何利用上述方式的攻击来产生 Physical Adversarial Textures（PAT）。这种 PATs 出现在 digital poster，在跟踪的目标附近被捕获时，会导致受害的模型丢失所跟踪的目标。在这个工作中，作者对 GOTURN 算法进行白盒攻击 white-box attack，即：可以访问到 GOTURN 网络的权重，所以可以进行反传。本文聚焦于跟踪行人和仿真机器人，并且假设 tracker 是在这些特定的类别上进行训练的。

作者用 Expectation Over Transformation algorithm（EOT）来解决愚弄时序跟踪模型的几个挑战，which minimizes the expected loss E[L] over a minibatch of B scenes imaged under diverse conditions。

4.1. Modeling rendering and lighting

为了优化对应 physical poster 纹理的损失，作者用 rendering process 来求导。Rendering 可以简单的分为两个步骤：

projecting the texture onto the surface of a physical item and then onto the camera's frame, and

shading the color of each frame pixel depending on light sources and material types.

4.2. PAT Attack: