Gradient Response Maps for Real-Time Detection of Textureless Objects

中文版

摘要

本文提出一种无需耗时训练阶段且能处理无纹理物体的实时3D物体实例检测方法。该方法的核心是一种用于模板匹配的新型图像表示，旨在对小幅度图像变换具有鲁棒性。这种鲁棒性基于图像梯度方向的扩散机制，使我们在解析图像时只需测试所有可能像素位置的一小部分子集，并能以有限的模板集表示3D物体。此外，我们证明，若存在稠密深度传感器，可扩展该方法，结合3D表面法向量信息进一步提升性能。本文还展示了如何利用现代计算机架构，构建输入图像的高效且具有强判别性的表示，从而实现实时处理数千个模板。大量真实数据实验表明，与当前主流方法相比，该方法在速度上更快，且对背景干扰的鲁棒性更强。

1 引言

实时物体实例检测与学习是计算机视觉领域的两大重要且具有挑战性的任务。在推动该领域发展的众多应用场景中，机器人技术对计算高效的方法需求尤为迫切——自主系统需持续适应动态未知环境，学习并识别新物体。

对于此类时间关键型应用，实时模板匹配是一种极具吸引力的解决方案。与统计学习技术不同，模板匹配可在线轻松学习并匹配新物体，而统计学习技术需要大量训练样本，且计算量通常过大，难以满足实时性能要求^[1]^[2]^[3]^[4]^[5]。统计学习方法效率低下的原因在于，其目标是检测某类物体中未见过的新物体，而非从多个视角检测先验已知的物体实例。经典模板匹配则致力于实现后者，其泛化性并非针对物体类别，而是针对视角采样。尽管这被视为相对简单的任务，但问题并非微不足道——训练序列与运行时序列之间，在视角、光照和遮挡方面仍存在显著变化。

当物体纹理足够丰富，可基于外观特征找到并识别关键点时，通过定义可快速计算并用于物体表征的局部面片描述子，已成功解决上述挑战^[6]。然而，此类方法在无纹理物体（如图1所示物体）上会失效，这类物体的外观通常由其投影轮廓主导。

为解决这一问题，本文提出一种基于实时模板识别的刚性3D物体实例检测新方法，该方法可快速构建并匹配模板。我们将证明，通过简单地向数据库中添加新模板，即可非常容易且几乎即时地学习新物体，同时保持可靠的实时识别性能。

同时，我们也希望保留统计方法的效率和鲁棒性——统计方法能快速学习如何排除无前景的图像位置，且由于其能从训练集中很好地泛化，往往具有较强的鲁棒性。因此，我们提出一种新的图像表示，该表示包含局部图像统计信息且计算速度快。其设计目标是对模板的小幅度平移和变形具有不变性，这已被证明是同一物体不同视角泛化的关键因素^[6:1]。此外，该表示允许我们通过跳过大量位置来快速解析图像，且不会损失可靠性。

本文方法与近期高效的模板匹配方法^[7]^[8]相关，这些方法仅利用图像及其梯度进行物体检测。因此，即使物体纹理不足以使用特征点技术，这些方法仍能工作，且几乎可即时学习新物体。此外，它们还能直接提供物体位姿的粗略估计，这对于需与环境交互的机器人而言尤为重要。然而，与以往的模板匹配方法^[9]^[10]^[11]^[12]类似，在存在强背景干扰（如图1所示）的情况下，这些方法的性能会严重下降，甚至失效。


图1 本文方法可利用梯度方向，在背景高度复杂的场景中，实时检测不同位姿下的无纹理3D物体。

为此，我们提出一种新方法，既能解决上述问题，又能在处理较大模板时显著提升速度。与文献^[7:1]中仅通过考虑主方向使模板对小变形和平移具有不变性不同，我们构建了一种具有类似不变性的输入图像表示，同时考虑局部图像邻域内的所有梯度方向。结合一种新型相似性度量，该方法可避免因背景中过强梯度导致的问题（如图1所示）。

为避免在使用这种更精细的方法时降低检测速度，我们必须仔细考虑现代CPU的工作原理。朴素实现会导致大量“内存缓存未命中”，从而降低计算速度。因此，我们展示了如何在内存中构建图像表示，以避免缓存未命中，并进一步利用SSE指令集进行高度并行化。我们认为这是一项重要贡献：由于硬件改进的特性，传统代码在新版CPU上运行速度更快这一假设已不再成立^[13]。对于计算量通常较大的计算机视觉领域而言，情况尤其如此。如今，必须考虑CPU架构，而这并非易事。

若存在稠密深度传感器，我们描述了该方法的扩展版本——利用额外的深度数据，同时结合2D图像梯度和3D表面法向量信息，进一步提高鲁棒性。我们提出一种从稠密深度图中实时鲁棒计算3D表面法向量的方法，确保保留遮挡轮廓上的深度不连续性，并平滑传感器的离散化噪声。然后，以与图像梯度类似的方式，将3D法向量与图像梯度结合使用。

在论文的后续部分，我们首先讨论相关工作，然后详细阐述本文方法。接着，分析方法的理论复杂度。最后，通过对具有挑战性的场景进行实验和定量评估，验证方法的有效性。

2 相关工作

多年来，模板匹配在检测跟踪应用中发挥了重要作用。这得益于其简单性和处理不同类型物体的能力。模板匹配无需大量训练集和耗时的训练阶段，且能处理低纹理或无纹理物体——这类物体使用基于特征点的方法通常难以检测^[6:2]^[14]。遗憾的是，这种鲁棒性的提升往往以计算量增加为代价，使得朴素模板匹配难以适用于实时应用。迄今为止，已有多项研究尝试降低模板匹配的计算复杂度。

早期的模板匹配方法^[12:1]及其扩展^[11:1]采用模板与输入图像轮廓之间的倒角距离作为相异性度量。例如，Gavrila和Philomin^[11:2]提出一种由粗到精的形状和参数空间搜索方法，对二值边缘图像的距离变换（DT）结果进行倒角匹配^[9:1]。倒角匹配最小化两组边缘点之间的广义距离。尽管结合距离变换时速度较快，但倒角变换的缺点是对异常值敏感——异常值通常由遮挡引起。

二值边缘图像的另一种常用度量是豪斯多夫距离^[15]。该距离度量图像中每个边缘点到模板中最近邻点的最大距离。然而，它对遮挡和背景干扰敏感。Huttenlocher等人^[10:1]试图通过引入广义豪斯多夫距离来克服这一缺点——仅计算图像与模型边缘之间第k大距离的最大值，以及模型与图像边缘之间第l大距离的最大值。这使得方法对一定比例的遮挡和背景干扰具有鲁棒性。但遗憾的是，该方法需要预先估计图像中的背景干扰程度，而这并非总能实现。此外，计算豪斯多夫距离的计算量较大，当使用大量模板时，难以满足实时应用要求。

倒角匹配和豪斯多夫距离均可轻松扩展为考虑边缘点的方向。如文献^[12:2]所示，这能显著减少误检率，但遗憾的是也会增加计算量。

文献^[16]中的方法同样基于距离变换，但其具有尺度不变性，且对平面透视变形的鲁棒性足以实现实时匹配。然而，该方法仅限于具有闭合轮廓的物体，而闭合轮廓并非总是存在。

上述所有方法均使用通过轮廓提取算法（如Canny检测器^[17]）得到的二值边缘图像，且对光照变化、噪声和模糊非常敏感。例如，当图像对比度降低时，提取的边缘像素数量会逐渐减少，其效果与遮挡程度增加类似。

文献^[18]提出的方法试图克服这些局限性，与使用图像轮廓不同，该方法利用图像梯度。其通过计算模板梯度与图像梯度之间的点积作为相似性度量。遗憾的是，当偏离物体位置或物体外观发生轻微变形时，该度量值会迅速下降。因此，必须密集地计算相似性度量，并使用大量模板来处理外观变化，这使得方法的计算成本很高。使用图像金字塔可在一定程度上提高速度，但如果不仔细采样尺度空间，精细但重要的结构可能会丢失。

与上述方法不同，还有一些方法致力于解决通用视觉识别问题：这些方法基于统计学习，目标是检测物体类别，而非先验已知的物体实例。尽管它们在类别泛化方面表现更好，但在学习和运行阶段通常速度慢得多，因此不适合在线应用。

例如，Amit等人^[19]提出一种由粗到精的方法，在局部邻域内扩散梯度方向。在初始阶段，为每个物体部分学习扩散程度。尽管该方法（用于车牌识别）达到了较高的识别率，但不具备实时处理能力。

方向梯度直方图（HOG）^[1:1]是另一种相关且非常流行的方法。它通过统计图像局部区域内的梯度方向分布来表征图像。该方法在均匀间隔的稠密网格上计算，并使用重叠的局部直方图归一化来提高性能。事实证明，该方法能提供可靠的结果，但由于计算复杂度较高，速度往往较慢。

Ferrari等人^[4:1]提出一种基于学习的方法，通过霍夫风格的投票机制，结合二值边缘图像物体边界上的非刚性形状匹配器来识别物体。该方法应用统计方法，从少量仅受物体周围边界框约束的图像中学习模型。尽管分类结果非常好，但由于计算成本高，该方法不适合实时物体跟踪，且无法精确返回物体的位姿。此外，它对二值边缘检测器的结果敏感——这是我们之前讨论过的问题。

Kalal等人^[20]最近提出一种基于在线学习的方法。他们展示了如何实时在线训练分类器，且训练集是自动生成的。然而，正如我们在实验中所见，该方法仅适用于背景平滑过渡的场景，不适合在未知背景下检测已知物体。

与上述基于学习的方法相反，还有一些方法专门针对不同视角进行训练。与本文的基于模板的方法类似，这些方法能检测不同位姿下的物体，但通常需要大量训练数据和较长的离线训练阶段。例如，在文献^[5:1]^[21]^[22]中，训练一个或多个分类器来检测不同视角下的人脸或车辆。

近期的3D物体检测方法与物体类别识别相关。Stark等人^[23]利用3D CAD模型，通过从不同视角渲染模型生成训练集。Liebelt和Schmid^[24]将几何形状和位姿先验与自然图像相结合。Su等人^[25]使用观测球的稠密多视图表示，结合基于部件的概率表示。尽管这些方法能泛化到物体类别，但不具备实时处理能力，且需要昂贵的训练过程。

在考虑深度数据的相关工作中，主要是与行人检测相关的方法^[26]^[27]^[28]^[29]。这些方法使用三种线索：图像亮度、深度和运动（光流）。Enzweiler等人^[26:1]提出的最新方法构建基于部件的行人模型，以处理由其他物体引起的遮挡，而不仅仅是其他方法^[27:1]^[29:1]中建模的自遮挡。除行人检测外，Sun等人^[30]提出一种用于物体分类、位姿估计和重建的方法。训练集由深度和图像亮度组成，通过改进的霍夫变换检测物体类别。尽管这些方法在实际应用中相当有效，但仍需要使用大量训练集进行详尽的训练。这在机器人应用中通常是不可行的——机器人需要探索未知环境并在线学习新物体。

如引言所述，我们最近提出一种基于模板的方法，用于从不同视角检测无纹理3D物体实例^[7:2]。每个物体由一组模板表示，利用局部主梯度方向构建输入图像和模板的表示。提取主方向有助于容忍小幅度平移和变形。该过程计算速度快，且在大多数情况下具有足够的判别性，可避免产生过多误检。

然而，我们注意到，当输入图像中的背景干扰产生不同方向的强梯度，导致梯度方向受到干扰时，该方法的性能会显著下降。在实际应用中，这种情况经常发生在物体轮廓附近——这是不幸的，因为轮廓是非常重要的线索，尤其是对于无纹理物体。本文提出的方法不存在此问题，且运行速度相同。此外，我们展示了如果存在类似Kinect的稠密深度传感器，如何扩展该方法以同时处理3D表面法向量。正如我们将看到的，这能显著提高鲁棒性。

3 本文方法

在本节中，我们描述模板表示方法，并展示如何构建输入图像的新表示，以及如何使用该表示快速解析图像以找到物体。我们首先推导相似性度量，强调其各方面的贡献。然后，展示如何实现该方法，以高效利用现代处理器架构。此外，我们还演示了如果存在稠密深度传感器，如何整合深度数据以提高鲁棒性。


图2 不同模态下的玩具鸭子。左图：强且具有判别性的图像梯度主要分布在轮廓上。梯度位置\(r_i\)以粉色显示。中图：若存在稠密深度传感器，还可利用表面3D法向量——其主要分布在鸭子的身体上。法向量位置\(r_k\)以粉色显示。右图：结合2D图像梯度和3D表面法向量可提高鲁棒性（见第2.9节）。这是由于视觉线索的互补性：梯度通常分布在物体轮廓上，而表面法向量分布在物体内部。

图2 不同模态下的玩具鸭子。左图：强且具有判别性的图像梯度主要分布在轮廓上。梯度位置$r_i$以粉色显示。中图：若存在稠密深度传感器，还可利用表面3D法向量——其主要分布在鸭子的身体上。法向量位置$r_k$以粉色显示。右图：结合2D图像梯度和3D表面法向量可提高鲁棒性（见第2.9节）。这是由于视觉线索的互补性：梯度通常分布在物体轮廓上，而表面法向量分布在物体内部。

3.1 相似性度量

我们的未优化相似性度量可视为对Steger在文献^[18:1]中定义的度量的改进，使其对小幅度平移和变形具有鲁棒性。Steger建议使用：

\[\begin{aligned} &\mathcal{E}_{\text {Steger }}(\mathcal{I}, \mathcal{T}, c)=\sum_{r \in \mathcal{P}}|\cos (\operatorname{ori}(\mathcal{O}, r)-\operatorname{ori}(\mathcal{I}, c+r))|, \end{aligned} \tag {1} \]

其中，$\operatorname{ori}(\mathcal{O}, r)$是待检测物体的参考图像$\mathcal{O}$中位置$r$处的梯度方向（以弧度为单位）。类似地，$\operatorname{ori}(\mathcal{I}, c+r)$是输入图像$\mathcal{I}$中位置$c$平移$r$后的梯度方向。我们使用集合$\mathcal{P}$定义参考图像$\mathcal{O}$中需要考虑的位置$r$。这样，我们可以高效地处理任意形状的物体。因此，模板$\mathcal{T}$被定义为一个二元组$\mathcal{T} = (\mathcal{O},\mathcal{P})$。

每个模板$\mathcal{T}$的构建过程如下：从对应的参考图像中提取一组最具判别性的梯度方向（如图2所示），并存储其位置。为提取最具判别性的梯度，我们考虑其梯度模长。在选择过程中，我们还考虑梯度的位置，以避免梯度方向集中在物体的某个局部区域，而物体的其他部分未被充分描述。若存在稠密深度传感器，可扩展该方法，加入3D表面法向量（如图2右图所示）。

仅考虑梯度方向而不考虑其模长，使度量对对比度变化具有鲁棒性；对余弦值取绝对值，则能正确处理物体的遮挡边界——无论物体位于深色背景还是亮色背景上，度量结果均不受影响。

式（1）中的相似性度量对背景干扰具有很强的鲁棒性，但对小幅度平移和变形不具备鲁棒性。常用的解决方案是先对方向进行量化，然后使用局部直方图（如SIFT^[6:3]或HOG^[1:2]）。然而，当背景中出现强梯度时，这种方法可能不稳定。在DOT方法^[7:3]中，我们保留区域的主方向。这比构建直方图速度更快，但存在同样的不稳定性问题。另一种选择是像DAISY^[31]那样对方向进行高斯卷积，但这对于我们的实时应用而言速度太慢。因此，我们提出一种更高效的解决方案。我们引入一种相似性度量，对于物体上的每个梯度方向，在对应的梯度位置邻域内搜索输入图像中最相似的方向。其数学表达式为：

\[\begin{aligned} &\mathcal{E}(\mathcal{I}, \mathcal{T}, c)=\sum_{r \in \mathcal{P}}\left(\max _{t \in \mathcal{R}(c+r)}|\cos (\operatorname{ori}(\mathcal{O}, r)-\operatorname{ori}(\mathcal{I}, t))|\right), \end{aligned} \tag {2} \]

其中，$\mathcal{R}(c+r) = [c+r-\frac{T}{2},c+r+\frac{T}{2}] \times [c+r-\frac{T}{2},c+r+\frac{T}{2}]$定义了输入图像中以位置$c+r$为中心、大小为$T$的邻域。因此，对于每个梯度，我们将局部邻域精确地对齐到对应的梯度位置，而在DOT方法中，梯度方向仅与某些规则网格对齐。下文将展示如何高效计算该度量。

3.2 梯度方向计算

在继续介绍方法之前，我们简要讨论使用梯度方向的原因以及如何轻松提取梯度方向。

我们选择使用图像梯度，是因为事实证明，与其他表示形式相比，图像梯度具有更强的判别性^[6:4]^[18:2]，且对光照变化和噪声具有鲁棒性。此外，对于无纹理物体，图像梯度通常是唯一可靠的图像线索。仅考虑梯度方向而不考虑其模长，使度量对对比度变化具有鲁棒性；对梯度方向之间的余弦值取绝对值，则能正确处理物体的遮挡边界——无论物体位于深色背景还是亮色背景上，度量结果均不受影响。

为提高鲁棒性，我们分别在输入图像的每个颜色通道上计算梯度方向，并对于每个图像位置，使用模长最大的通道对应的梯度方向（例如，文献^[1:3]中采用的方法）。给定RGB彩色图像$\mathcal{I}$，位置$x$处的梯度方向图$\mathcal{I}_{\mathcal{G}}(x)$计算如下：

\[\begin{aligned} &\mathcal{I}_{\mathcal{G}}(x)=\operatorname{ori}(\hat{\mathcal{C}}(x)), \end{aligned} \tag {3} \]

其中，

\[\begin{aligned} &\hat{\mathcal{C}}(x)=\underset{\mathcal{C} \in\{R, G, B\}}{\operatorname{argmax}}\left\|\frac{\partial \mathcal{C}}{\partial x}\right\|, \end{aligned} \tag {4} \]

$R$、$G$、$B$分别是对应彩色图像的RGB通道。

为了对梯度方向图进行量化，我们忽略梯度的方向（即正负），仅考虑梯度方向角度，并将方向空间划分为$n_0$个等间隔的区间（如图3所示）。为使量化对噪声具有鲁棒性，我们为每个位置分配其$3\times3$邻域内出现频率最高的量化方向。同时，我们仅保留模长大于某个小阈值的梯度。对于VGA分辨率的图像，整个未优化过程在CPU上约需31毫秒。


图3 左上：梯度方向量化：粉色方向最接近第二个区间。右上：带有标定图案的玩具鸭子。左下：在灰度图像上计算得到的梯度图像，物体轮廓几乎不可见。右下：使用本文方法计算得到的梯度图像，物体轮廓细节清晰可见。

3.3 方向扩散

为避免在每次将新模板与图像位置进行匹配时都计算式（2）中的最大值运算，我们首先引入一种新的二值表示——记为$\mathcal{J}$，用于表示每个图像位置周围的梯度。然后，我们将该表示与查找表结合使用，以高效地预计算这些最大值。


图4 梯度方向扩散。左图：梯度方向及其二进制编码。我们不考虑梯度的方向（即正负）。a) 首先提取并量化输入图像中的梯度方向（橙色所示）。b) 然后，每个方向周围的位置也被标记为该方向（蓝色箭头所示）。这使我们的相似性度量对小幅度平移和变形具有鲁棒性。c) \(\mathcal{J}\)是该操作后方向的高效表示，可快速计算。本图中，\(T=3\)，\(n_0=5\)。在实际应用中，我们使用\(T=8\)，\(n_0=8\)。

图4 梯度方向扩散。左图：梯度方向及其二进制编码。我们不考虑梯度的方向（即正负）。a) 首先提取并量化输入图像中的梯度方向（橙色所示）。b) 然后，每个方向周围的位置也被标记为该方向（蓝色箭头所示）。这使我们的相似性度量对小幅度平移和变形具有鲁棒性。c) $\mathcal{J}$是该操作后方向的高效表示，可快速计算。本图中，$T=3$，$n_0=5$。在实际应用中，我们使用$T=8$，$n_0=8$。

图4展示了$\mathcal{J}$的计算过程。与以往方法^[6:5]^[1:4]^[7:4]类似，我们首先将方向量化为少量$n_o$个值。这使我们能够将输入图像$\mathcal{I}$的梯度方向$\operatorname{ori}(\mathcal{I}, t)$在其位置周围进行“扩散”，从而得到原始图像的新表示。

为提高效率，我们使用二进制字符串对扩散到给定图像位置$m$的方向组合进行编码：字符串的每个比特对应一个量化方向，若该方向出现在$m$的邻域内，则对应的比特置1。所有图像位置的字符串构成图4右侧的图像$\mathcal{J}$。这些字符串将作为查找表的索引，用于快速预计算相似性度量，具体过程将在下一节中描述。

$\mathcal{J}$的计算非常高效：我们首先为每个量化方向计算一个映射图，若输入图像中对应像素位置的方向为该量化方向，则映射图中对应位置的值置1，否则置0。然后，将这些映射图在$[-\frac{T}{2},+\frac{T}{2}] \times [-\frac{T}{2},+\frac{T}{2}]$范围内进行平移，并通过或运算合并所有平移后的版本，得到$\mathcal{J}$。

3.4 响应图预计算

如图5所示，我们将$\mathcal{J}$与查找表结合使用，为每个位置和模板中每个可能的方向$\operatorname{ori}(\mathcal{O}, r)$预计算式（2）中的最大值运算结果。我们将结果存储在二维映射图$S_i$中。然后，为了计算相似性函数，只需对这些$\mathcal{S}_i$中的值进行求和即可。


图5 响应图$\mathcal{S}_i$的预计算。左图：每个量化方向对应一个响应图。响应图存储其对应方向与“不变图像”中已存储的方向$\operatorname{ori}_j$之间的最大相似性。右图：通过将$\mathcal{J}$中方向列表的二进制表示作为索引访问最大相似性查找表，可高效完成该过程。


图6 调整响应图像$\mathcal{S}_i$在内存中的存储方式。将图像行中$x$轴上间隔$T$个像素的值在内存中连续存储。由于每个响应图有$T^2$个这样的线性内存块，且有$n_0$个量化方向，因此最终有$T^2{\cdot}n_0$个不同的线性内存块。

我们为$n_o$个量化方向中的每个方向使用一个查找表$\tau_i$，该查找表离线计算如下：

\[\begin{aligned} &\tau_i[\mathcal{L}]=\max _{l \in \mathcal{L}}|\cos (i-l)|, \end{aligned} \tag{5} \]

其中：

$i$是量化方向的索引；为简化表示，我们也用$i$表示对应的弧度角度；
$\mathcal{L}$是3.3节中描述的、方向为$i$的梯度局部邻域内出现的方向列表。在实际应用中，我们使用$\mathcal{L}$的二进制表示对应的整数值作为查找表的索引。

现在，对于每个方向$i$，我们可以计算响应图$\mathcal{S}_i$在每个位置$c$的值：

\[\begin{aligned} &\mathcal{S}_i(c) = \tau_i[\mathcal{J}(c)]. \end{aligned} \tag{6} \]

最后，式（2）中的相似性度量可计算为：

\[\begin{aligned} &\mathcal{E}(\mathcal{I}, \mathcal{T}, c)=\sum_{r \in \mathcal{P}} \mathcal{S}_{\operatorname{ori}(\mathcal{O}, r)}(c+r). \end{aligned} \tag {7} \]

由于映射图$\mathcal{S}_i$在模板之间共享，因此一旦计算出这些映射图，就可以非常快速地将多个模板与输入图像进行匹配。

3.5 内存线性化以实现并行化

借助式（7），我们只需对响应图$\mathcal{S}_i$中的值进行求和，即可将模板与整个输入图像进行匹配。然而，3.3节中方向扩散的优点之一是，只需每隔$T$个像素进行一次计算，而不会降低识别性能。若要高效利用这一特性，必须考虑现代计算机的架构。

现代处理器并非每次从主内存中读取一个数据值，而是同时读取多个数据值（称为缓存行）。随机访问内存会导致缓存未命中，从而降低计算速度。另一方面，从同一缓存行中访问多个值的成本非常低。因此，按照读取顺序存储数据可显著提高计算速度。此外，这还允许并行化：例如，若像我们的$\mathcal{S}_i$映射图那样使用8位值，SSE指令可并行处理16个值。在多核处理器或GPU上，可同时执行更多操作。例如，NVIDIA Quadro GTX 590可并行执行1024个操作。

因此，如图6所示，我们将预计算的响应图$\mathcal{S}_i$以缓存友好的方式存储在内存中：我们重新组织每个响应图，使图像行中$x$轴上间隔$T$个像素的值在内存中连续存储。当处理完当前行后，继续处理$y$轴上间隔$T$个像素的行。

最后，如图7所示，对于给定模板，在每个采样图像位置计算相似性度量时，只需将线性化的内存块与根据模板中位置$r$计算的适当偏移量相加即可。


图7 线性内存的使用。对于给定模板，通过将模板不同方向对应的线性内存块（方向由指向不同方向的黑色箭头表示）相加，可计算整个输入图像的相似性度量。线性内存块的偏移量根据模板中相对于锚点的位置\((r_x,r_y)^T\)确定。使用并行SSE指令执行这些加法运算可进一步提高计算速度。在最终的相似性图\(\varepsilon\)中，只需解析每隔\(T\)个像素的位置即可找到物体。

图7 线性内存的使用。对于给定模板，通过将模板不同方向对应的线性内存块（方向由指向不同方向的黑色箭头表示）相加，可计算整个输入图像的相似性度量。线性内存块的偏移量根据模板中相对于锚点的位置$(r_x,r_y)^T$确定。使用并行SSE指令执行这些加法运算可进一步提高计算速度。在最终的相似性图$\varepsilon$中，只需解析每隔$T$个像素的位置即可找到物体。

3.6 稠密深度传感器扩展

除彩色图像外，近期的消费级硬件（如Kinect）能够实时捕获稠密深度图。若这些深度图与彩色图像对齐，我们可利用深度信息进一步提高方法的鲁棒性——正如我们在文献^[32]中最近展示的那样。


图8 左上：表面法向量量化：粉色表面法向量最接近预计算的表面法向量v4，因此被放入与v4相同的区间。右上：站在办公室中的人。左下：对应的深度图像。右下：使用本文方法计算得到的表面法向量，细节清晰可见，且深度不连续性处理效果良好。为便于观察，我们移除了背景。

与图像线索类似，我们决定在模板表示中使用从稠密深度图计算得到的量化表面法向量（如图8所示）。表面法向量可同时表示近处和远处的物体，且能保留精细结构。

下文将提出一种在稠密深度图中快速且鲁棒地估计表面法向量的方法。对于每个像素位置$x$，我们考虑深度函数$\mathcal{D}(x)$的一阶泰勒展开：

\[\begin{aligned} {\mathcal{D}(x+dx)-\mathcal{D}(x) = dx^T{\nabla}{D} + h.o.t. } \end{aligned} \tag{8} \]

在$x$周围定义的面片内，每个像素偏移$dx$都能得到一个约束$\nabla\mathcal{D}$值的方程，从而允许我们在最小二乘意义下估计最优梯度$\nabla\hat{\mathcal{D}}$。该深度梯度对应于通过三个点$X$、$X_1$和$X_2$的3D平面：

\[\begin{aligned} X=\vec{v}\mathcal{D}(x) \end{aligned} \tag{9} \]

\[\begin{aligned} X_1 = \vec{v}(x+[1,0]^T)(\mathcal{D}(x)+[1,0])\nabla{\hat{\mathcal{D}}}, \end{aligned} \tag{10} \]

\[\begin{aligned} X_2 = \vec{v}(x+[1,0]^T)(\mathcal{D}(x)+[1,0])\nabla{\hat{\mathcal{D}}}. \end{aligned} \tag{11} \]

其中，$\vec{v}(x)$是穿过像素$x$的视线方向向量，由深度传感器的内部参数计算得到。投影到$x$的3D点处的表面法向量可通过$X_1 -X$和$X_2 - X$的归一化叉积估计。

然而，在遮挡轮廓附近，式（8）的一阶近似不再成立，上述方法的鲁棒性会下降。受双边滤波的启发，我们忽略与中心像素深度差超过阈值的像素的贡献。在实际应用中，该方法能有效平滑表面上的量化噪声，同时在强深度不连续性附近仍能提供有意义的表面法向量估计。对于表面法向量，我们将相似性度量定义为归一化表面法向量的点积，而非式（2）中图像梯度的余弦差。除此之外，我们对表面法向量应用与图像梯度相同的处理技术。组合后的相似性度量是图像梯度的相似性度量与表面法向量的相似性度量之和。

为了将表面法向量整合到我们的框架中，我们需要将3D表面法向量量化为$n_0$个区间。具体方法是计算得到的法向量与一组预计算的$n_0$个向量之间的角度。这些向量以圆锥形状排列，圆锥的顶点指向相机。为使量化对噪声具有鲁棒性，我们为每个位置分配其$3\times3$邻域内出现频率最高的量化值。整个过程非常高效，在CPU上仅需14毫秒，在GPU上不到1毫秒。

3.7 计算时间分析

在本节中，我们比较文献^[18:3]中的原始方法与本文提出的方法所需的运算次数。

文献^[18:4]中的${\varepsilon}_{Steger}$在$M\times N$的图像上评估$R$个模板所需的时间为$M{\cdot}N \cdot R{\cdot}G( S+A)$，其中$G$是模板中梯度的平均数量，$S$是计算两个梯度方向之间相似性函数的时间，$A$是两个值相加的时间。

将${\varepsilon}_{Steger}$替换为式（2）并利用$\mathcal{J}$，计算时间变为$M{\cdot}N{\cdot}T^2{\cdot}O + \frac{M{\cdot}N}{T^2}{\cdot}R{\cdot}G(L+A)$，其中$L$是访问一次查找表$\tau_{i}$所需的时间，$O$是两个值进行或运算的时间。第一项对应计算$\mathcal{J}$所需的时间，第二项对应实际计算式（2）所需的时间。

预计算响应图$\mathcal{S}_{i}$进一步将方法的复杂度变为$M{\cdot}N{\cdot}(T^2{\cdot}O +n_0{\cdot}L)+ \frac{M{\cdot}N}{T^2}{\cdot}R{\cdot}G{\cdot}A$。

内存线性化允许我们额外使用并行SSE指令。为了并行执行16个操作，我们使用字节对查找表中的响应值进行近似。此时，算法的最终复杂度为$M{\cdot}N{\cdot}(T^2{\cdot}O +(n_0+1){\cdot}L)+ \frac{M{\cdot}N}{16T^2}{\cdot}R{\cdot}G{\cdot}A$。

在实际应用中，我们使用 $T= 8$，$M =480$，$N =640$，$R > 1000$，$G{\approx}100$，$n_0 =8$。为简化计算，假设 $L{\approx}A{\approx}O{\approx}1$ 个时间单位，则当模板数量 $R$ 较大时，与原始能量函数 ${\varepsilon}_{Steger}$ 相比，速度提升因子为 $T^2{\cdot}16(1+S)$。需要注意的是，我们未将方法的缓存友好性纳入考虑，因为这很难建模。尽管如此，由于文献^[18:5]中通过两个对应梯度的归一化点积计算两个方向的相似性，因此$S$可设为3，此时理论速度提升因子至少为4096。

3.8 实验验证

我们将本文提出的方法（称为LINE，即内存线性化方法）与DOT^[7:5]、HOG^[1:5]、TLD^[20:1]和Steger方法^[18:6]进行了比较。在实验中，我们使用了LINE的三种不同变体：仅使用图像梯度的LINE-2D、仅使用表面法向量的LINE-3D，以及同时使用两者的多模态方法LINE-MOD。


图9 多模态融合可得到更具判别性的响应函数。此处，我们在所示图像上比较LINE-MOD与LINE-2D。我们绘制了两种方法的响应函数相对于猴子真实位置的变化情况。可以看出，LINE-MOD的响应呈现出单一且具有强判别性的峰值，而LINE-2D有多个高度相近的峰值。这是LINE-MOD性能更好、误检率更低的原因之一。

DOT是快速模板匹配的代表方法，而HOG和Steger是速度较慢但鲁棒性很强的模板匹配方法。与这些方法不同，TLD代表了在线学习领域的最新研究成果。

对于DOT、HOG和Steger方法，我们未使用灰度梯度，而是采用3.2节中的彩色梯度方法，这提高了识别性能。此外，我们使用了作者提供的DOT和TLD实现。对于Steger方法，我们实现了自己的版本，并使用了四个金字塔层级。对于HOG，我们也实现了优化版本，并将原始HOG工作中提到的支持向量机替换为最近邻搜索。这样，我们可以将其作为一种鲁棒的表示形式，像其他方法一样快速学习新模板。

实验在标准笔记本电脑的单个处理器上进行，处理器为Intel迅驰酷睿2双核，主频2.4GHz，内存3GB。我们使用Primesense PSDK 5.0设备获取图像和深度数据。


图10 左图：本文新方法可实时运行，在640×480的图像上处理超过3000个模板时，帧率约为10帧/秒。中图：本文新方法对遮挡的响应呈线性关系。右图：3.9节中六个物体的平均识别分数随遮挡程度的变化情况。

3.9 鲁棒性

我们使用了六个序列，每个序列包含超过2000张真实图像。每个序列均包含光照变化、大视角变化和高度复杂的背景。通过在每个场景中附加标定图案获取真实标签，从而确定物体的实际位置。模板是在均匀背景下学习得到的。

若返回的位置与真实位置之间的距离在固定半径内，则认为物体被正确检测。


图11 LINE-2D（基于梯度）、LINE-3D（基于法向量）和LINE-MOD（结合两种线索）与DOT[7]、HOG[1]、Steger[18]和TLD[20]在真实3D物体上的比较。每一行对应一个序列（每个序列包含超过2000张图像），背景高度复杂：猴子、鸭子和相机。所有方法均在均匀背景下训练。左图：真阳性率与平均假阳性率的关系图。在兼顾鲁棒性和速度的情况下，LINE-2D优于所有其他基于图像的方法。若存在稠密深度传感器，可将LINE扩展为使用3D表面法向量，得到LINE-3D和LINE-MOD。LINE-MOD对所有物体的识别率大致相同，而其他方法的识别率随物体类型的不同差异较大。在大多数情况下，LINE-MOD优于其他方法。中图：真阳性和假阳性的分布随阈值的变化情况。与LINE-2D相比，LINE-MOD的真阳性和假阳性更易区分。右图：对应序列中的一张样本图像，以及LINE-MOD检测到的物体。

图11 LINE-2D（基于梯度）、LINE-3D（基于法向量）和LINE-MOD（结合两种线索）与DOT[7]、HOG[1]、Steger[18]和TLD[20]在真实3D物体上的比较。每一行对应一个序列（每个序列包含超过2000张图像），背景高度复杂：猴子、鸭子和相机。所有方法均在均匀背景下训练。左图：真阳性率与平均假阳性率的关系图。在兼顾鲁棒性和速度的情况下，LINE-2D优于所有其他基于图像的方法。若存在稠密深度传感器，可将LINE扩展为使用3D表面法向量，得到LINE-3D和LINE-MOD。LINE-MOD对所有物体的识别率大致相同，而其他方法的识别率随物体类型的不同差异较大。在大多数情况下，LINE-MOD优于其他方法。中图：真阳性和假阳性的分布随阈值的变化情况。与LINE-2D相比，LINE-MOD的真阳性和假阳性更易区分。右图：对应序列中的一张样本图像，以及LINE-MOD检测到的物体。


图12 与图11相同的实验，物体不同：杯子、玩具车和打孔器。这些结果是在每个物体的2000张图像上评估得到的。

从图11和图12的左列可以看出，LINE-2D在大多数情况下优于所有其他基于图像的方法。唯一的例外是Steger方法，其结果与LINE-2D相似。这是因为我们的方法与Steger方法使用相似的分数函数。然而，从图10可以清楚地看出，我们的方法在计算时间方面具有显著优势。

TLD检测结果较差的原因是，尽管该方法在背景平滑过渡的场景中表现良好，但不适合在未知背景下检测已知物体。

若存在稠密深度传感器，我们可在不降低运行速度的前提下进一步提高鲁棒性。图11和图12的左列展示了这一点——LINE-MOD始终优于所有其他方法，且假阳性率非常低。我们认为这是由于物体特征的互补性，它们相互弥补了各自的不足（见图9）。仅使用深度线索的性能通常不太好。

表1 不同方法在相似性度量的不同阈值下的真阳性率和假阳性率

序列	LINE-MOD	LINE-3D	LINE-2D	HOG	DOT	Steger	TLD
玩具猴子（2164张图像）	97.9%-0.3%	86.1%-13.8%	50.8%-49.1%	51.8%-48.2%	8.6%-91.4%	69.6%-30.3%	0.8%-99.1%
相机（2173张图像）	97.5%-0.3%	61.9%-38.1%	92.8%-6.7%	18.2%-81.8%	1.9%-98.0%	96.9%-0.4%	53.3%-46.6%
玩具车（2162张图像）	97.7%-0.0%	95.6%-2.5%	96.9%-0.4%	44.1%-55.9%	34.0%-66.0%	83.6%-16.3%	0.1%-98.9%
杯子（2193张图像）	96.8%-0.5%	88.3%-10.6%	92.8%-6.0%	81.1%-18.8%	64.1%-35.8%	90.2%-8.9%	10.4%-89.6%
玩具鸭子（2223张图像）	97.9%-0.0%	89.0%-10.0%	91.7%-8.0%	87.6%-12.4%	78.2%-21.8%	92.2%-7.6%	28.0%-71.9%
打孔器（2184张图像）	97.0%-0.2%	70.0%-30.0%	96.4%-0.9%	92.6%-7.4%	87.7%-12.0%	90.3%-9.7%	26.5%-73.4%

在某些情况下，没有返回检测结果，因此真阳性率和假阳性率之和可能低于100%。在兼顾性能和速度的情况下，LINE-2D优于所有其他基于图像的方法。若存在稠密深度传感器，我们的LINE-MOD方法可在几乎无假阳性的情况下获得很高的识别率，且优于所有其他方法。

表1更直观地展示了LINE-MOD的优越性：若为每种方法设置阈值，使真阳性率达到97%，且仅评估响应最大的检测结果，则LINE-MOD可获得高检测率和极低的假阳性率。相比之下，LINE-2D的真阳性率通常超过90%，但假阳性率不可忽略。真阳性率的计算方式为正确检测数与图像数的比值；类似地，假阳性率为错误检测数与图像数的比值。

LINE-MOD具有高鲁棒性的原因之一是多模态方法具有良好的可分离性，如图11和图12的中间部分所示：与LINE-2D的真阳性和假阳性存在显著重叠不同，LINE-MOD在特定阈值（在我们的实现中约为80）下，几乎可以将所有真阳性与所有假阳性区分开。这具有多个优点。首先，通过将阈值设置为该特定值，我们可以检测到几乎所有物体实例。其次，我们还知道，几乎所有相似性分数高于该特定值的返回模板都是真阳性。第三，阈值始终在相同的值附近，这支持了该方法可能对其他物体也有效的结论。

3.10 速度

学习新模板只需提取并存储图像特征（若使用深度信息，还需存储深度特征），这一过程几乎是即时完成的。因此，我们重点关注运行时性能。

图10中的运行时间表明，LINE方法总体上是实时的，在CPU上处理VGA分辨率图像和超过3000个模板时，帧率约为10帧/秒。LINE-MOD与LINE-2D和LINE-3D的计算时间差异较小，这是因为LINE-MOD的预处理步骤稍慢——它包含了LINE-2D和LINE-3D的两个预处理步骤。

DOT最初比LINE快，但随着模板数量的增加，速度会变慢。这是因为LINE的运行时间与模板大小无关，而DOT的运行时间与模板大小有关。因此，为了处理较大的物体，DOT必须使用较大的模板，这使得当模板数量增加时，方法的速度会变慢。

我们实现的Steger等人的方法比LINE-MOD方法慢约100倍。需要注意的是，为了提高效率，我们使用了四个金字塔层级——这是3.7节中速度提升因子不同的原因之一，在3.7节中我们假设没有使用图像金字塔。

TLD使用与文献^[33]类似的树分类器，这是其运行时间随模板数量变化相对较小的原因。由于本文关注的是检测，因此在本实验中，我们仅考虑TLD的检测部分，而不考虑跟踪部分。

3.11 遮挡

我们还测试了LINE-2D和LINE-MOD对遮挡的鲁棒性。我们向图像中添加合成噪声和光照变化，逐步遮挡3.9节中的六个不同物体，并测量相应的响应值。正如预期的那样，如图10所示，LINE-2D和LINE-MOD使用的相似性度量随遮挡百分比的变化呈线性关系。这是一个理想的特性，因为它允许通过根据可容忍的遮挡百分比设置检测阈值，来检测部分遮挡的模板。

我们还在真实场景中进行了实验：首先在均匀背景下学习六个物体，然后添加大量2D和3D背景干扰。在识别过程中，逐步遮挡物体。若响应最大的模板位于真实物体位置的固定半径内，则认为物体被正确识别。平均识别结果如图10所示：当遮挡程度为20%时，LINE-2D仍能识别物体；当遮挡程度超过30%时，LINE-MOD仍能识别物体。

3.12 模板数量


图17 大约2000个模板即可检测任意物体。半球表示倾斜和俯仰旋转的检测范围。此外，还能处理±80度的平面内旋转和1.0到2.0范围内的尺度变化。

此处讨论检测任意物体所需的平均模板数量。在我们的实现中，大约需要2000个模板来检测具有360度倾斜旋转、90度俯仰旋转和80度平面内旋转的物体——倾斜和俯仰覆盖了图17中的半球。使用此处给出的模板数量，该方法能处理1.0到2.0范围内的尺度变化。

3.13 示例


图14 在具有部分遮挡、光照变化和强背景干扰的复杂室外场景中，LINE-2D可实时检测不同位姿下的不同无纹理3D物体。


图15 在背景高度复杂且存在部分遮挡的场景中，LINE-2D可实时检测不同位姿下的不同无纹理3D物体。


图16 在背景高度复杂且存在部分遮挡的场景中，我们的LINE-MOD方法可实时同时检测不同位姿下的不同无纹理3D物体。

图14、图15和图16展示了我们的方法在背景高度复杂的室内和室外场景中对无纹理物体的检测结果。物体在部分遮挡、大幅度位姿变化和光照变化的情况下被成功检测。在图14和图15中，我们仅使用梯度特征；而在图16中，我们还使用了3D法向量特征。需要注意的是，由于Primesense设备在强光下无法生成深度图，我们无法在室外使用LINE-MOD。

3.14 失败案例


图13 典型失败案例。运动模糊可能导致：(a) 假阴性：红色汽车未被检测到；(b) 假阳性：在背景上检测到鸭子。相似结构也可能导致假阳性：(c) 在碗上检测到猴子雕像；(d) 打孔器在某些视角下的模板判别性不足，此处在具有正交线的结构上检测到了打孔器。

图13展示了我们方法的局限性。在存在运动模糊的情况下，方法容易产生假阳性和假阴性。当某些模板的判别性不足时，也可能产生假阳性和假阴性。

4 结论

本文提出一种全新方法，能够在背景高度复杂、光照变化及存在噪声的场景下，实时检测三维无纹理物体。研究还表明，若配备稠密深度传感器，可鲁棒且高效地计算三维表面法向量，并将其与二维梯度结合使用，从而进一步提升识别性能。本文详细阐述了如何利用现代计算机的架构优势，构建输入图像的快速且高判别性表示，实现对数千个任意尺寸、任意形状模板的实时处理。此外，实验结果证实，在识别率与速度的综合表现上，本方法优于当前主流方法，尤其在背景高度复杂的环境中优势更为显著。

致谢

作者感谢斯特凡·霍尔泽（Stefan Holzer）与库尔特·科诺利奇（Kurt Konolige）的有益讨论及宝贵建议。本项目由德国联邦教育与研究部（BMBF）的AVILUSplus项目（项目编号：01IM08002）资助。P. 施图姆（P. Sturm）感谢亚历山大·冯·洪堡基金会提供的研究奖学金，该奖学金支持了他在慕尼黑工业大学的学术休假研究。纳西尔·纳瓦布（Nassir Navab）与樊尚·勒佩蒂（Vincent Lepetit）为本论文的共同资深作者。

参考文献

Stefan Hinterstoisser, Member, IEEE, Cedric Cagniart, Slobodan Ilic, Member, IEEE, Peter Sturm, Member, IEEE, Nassir Navab, Member, IEEE, Pascal Fua, Fellow, IEEE, and Vincent Lepetit

英文版

Abstract

We present a method for real-time 3D object instance detection that does not require a time-consuming training stage, and can handle untextured objects. At its core, our approach is a novel image representation for template matching designed to be robust to small image transformations. This robustness is based on spread image gradient orientations and allows us to test only a small subset of all possible pixel locations when parsing the image, and to represent a 3D object with a limited set of templates. In addition, we demonstrate that if a dense depth sensor is available we can extend our approach for an even better performance also taking 3D surface normal orientations into account. We show how to take advantage of the architecture of modern computers to build an efficient but very discriminant representation of the input images that can be used to consider thousands of templates in real time. We demonstrate in many experiments on real data that our method is much faster and more robust with respect to background clutter than current state-of-the-art methods.

Index Terms—Computer vision, real-time detection and object recognition, tracking, multimodality template matching.

1 INTRODUCTION

REAL-TIME object instance detection and learning are two important and challenging tasks in computer vision. Among the application fields that drive development in this area, robotics especially has a strong need for computationally efficient approaches as autonomous systems continuously have to adapt to a changing and unknown environment and to learn and recognize new objects.

For such time-critical applications, real-time template matching is an attractive solution because new objects can be easily learned and matched online, in contrast to statistical-learning techniques that require many training samples and are often too computationally intensive for real-time performance ^[1:6]^[2:1]^[3:1]^[4:2]^[5:2] . The reason for this inefficiency is that those learning approaches aim at detecting unseen objects from certain object classes instead of detecting a priori, known object instances from multiple viewpoints. Classical template matching tries to achieve the latter in classical template matching where generalization is not performed on the object class but on the viewpoint sampling. While this is considered as an easier task, it does not make the problem trivial, as the data still exhibit significant changes in viewpoint, in illumination, and in occlusion between the training and the runtime sequence.

When the object is textured enough for keypoints to be found and recognized on the basis of their appearance, this difficulty has been successfully addressed by defining patch descriptors that can be computed quickly and used to characterize the object ^[6:6]. However, this kind of approach will fail on textureless objects such as those of Fig. 1, whose appearance is often dominated by their projected contours.

To overcome this problem, we propose a novel approach based on real-time template recognition for rigid 3D object instances, where the templates can be both built and matched very quickly. We will show that this makes it very easy and virtually instantaneous to learn new incoming objects by simply adding new templates to the database while maintaining reliable real-time recognition.

However, we also wish to keep the efficiency and robustness of statistical methods, as they learn how to reject unpromising image locations very quickly and tend to be very robust because they can generalize well from the training set. We therefore propose a new image representation that holds local image statistics and is fast to compute. It is designed to be invariant to small translations and deformations of the templates, which has been shown to be a key factor to generalization to different viewpoints of the same object ^[6:7]. In addition, it allows us to quickly parse the image by skipping many locations without loss of reliability.

Our approach is related to recent and efficient template matching methods ^[7:6], ^[8:1] which consider only images and their gradients to detect objects. As such, they work even when the object is not textured enough to use feature point techniques, and learn new objects virtually instantaneously. In addition, they can directly provide a coarse estimation of the object pose, which is especially important for robots which have to interact with their environment. However, similarly to previous template matching approaches ^[9:2]^[10:2]^[11:3]^[12:3], they suffer severe degradation of performance or even failure in the presence of strong background clutter such as the one displayed in Fig. 1.

Fig. 1. Our method can detect textureless 3D objects in real time under different poses over heavily cluttered background using gradient orientation.

We therefore propose a new approach that addresses this issue while being much faster for larger templates. Instead of making the templates invariant to small deformations and translations by considering dominant orientations only as in ^[7:7], we build a representation of the input images which has similar invariance properties but consider all gradient orientations in local image neighborhoods. Together with a novel similarity measure, this prevents problems due to too strong gradients in the background, as illustrated by Fig. 1.

To avoid slowing down detection when using this finer method, we have to carefully consider how modern CPUs work. A naive implementation would result in many “memory cache misses,” which slow down the computations, and we thus show how to structure our image representation in memory to prevent these and to additionally exploit heavy SSE parallelization. We consider this as an important contribution: Because of the nature of the hardware improvements, it is no longer guaranteed that legacy code will run faster on the new versions of CPUs ^[13:1]. This is particularly true for computer vision, where algorithms are often computationally expensive. It is now required to take the CPU architecture into account, which is not an easy task.

For the case where a dense depth sensor is available, we describe an extension of our method where additional depth data are used to further increase the robustness by simultaneously leveraging the information of the 2D image gradients and 3D surface normals. We propose a method that robustly computes 3D surface normals from dense depth maps in realtime, making sure to preserve depth discontinuities on occluding contours and to smooth out discretization noise of the sensor. The 3D normals are then used together with the image gradients and in a similar way.

In the remainder of the paper, we first discuss related work before we explain our approach. We then discuss the theoretical complexity of our approach. We finally present experiments and quantitative evaluations for challenging scenes.

Template matching has played an important role in tracking-by-detection applications for many years. This is due to its simplicity and its capability of handling different types of objects. It neither needs a large training set nor a time-consuming training stage, and can handle lowtextured or textureless objects, which are, for example, difficult to detect with feature points-based methods ^[6:8]^[14:1]. Unfortunately, this increased robustness often comes at the cost of an increased computational load that makes naive template matching inappropriate for real-time applications. So far, several works have attempted to reduce this complexity.

An early approach to Template Matching ^[12:4] and its extension ^[11:4] include the use of the Chamfer distance between the template and the input image contours as a dissimilarity measure. For instance, Gavrila and Philomin ^[11:5] introduced a coarse-to-fine approach in shape and parameter space using Chamfer Matching ^[9:3] on the Distance Transform (DT) of a binary edge image. The Chamfer Matching minimizes a generalized distance between two sets of edge points. Although fast when using the Distance Transform, the disadvantage of the Chamfer Transform is its sensitivity to outliers, which often result from occlusions.

Another common measure on binary edge images is the Hausdorff distance ^[15:1]. It measures the maximum of all distances from each edge point in the image to its nearest neighbor in the template. However, it is sensitive to occlusions and clutter. Huttenlocher et al. ^[10:3] tried to avoid that shortcoming by introducing a generalized Hausdorff distance which only computes the maximum of the kth largest distances between the image and the model edges and the lth largest distances between the model and the image edges. This makes the method robust against a certain percentage of occlusions and clutter. Unfortunately, a prior estimate of the background clutter in the image is required but not always available. Additionally, computing the Hausdorff distance is computationally expensive and prevents its real-time application when many templates are used.

Both Chamfer Matching and the Hausdorff distance can easily be modified to take the orientation of edge points into account. This drastically reduces the number of false positives as shown in ^[12:5], but unfortunately also increases the computational load.

The methof of ^[16:1] is also based on the Distance Transform; however, it is invariant to scale changes and robust enough against planar perspective distortions to do real-time matching. Unfortunately, it is restricted to objects with closed contours, which are not always available.

All these methods use binary edge images obtained with a contour extraction algorithm, using the Canny detector ^[17:1], for example, and they are very sensitive to illumination changes, noise, and blur. For instance, if the image contrast is lowered, the number of extracted edge pixels progressively decreases, which has the same effect as increasing the amount of occlusion.

The method proposed in ^[18:7] tries to overcome these limitations by considering the image gradients in contrast to the image contours. It relies on the dot product as a similarity measure between the template gradients and those in the image. Unfortunately, this measure rapidly declines with the distance to the object location or when the object appearance is even slightly distorted. As a result, the similarity measure must be evaluated densely and with many templates to handle appearance variations, making the method computationally costly. Using image pyramids provides some speed improvements; however, fine but important structures tend to be lost if one does not carefully sample the scale space.

Contrary to the above-mentioned methods, there are also approaches addressing the general visual recognition problem: They are based on statistical learning and aim at detecting object categories rather than a priori, known object instances. While they are better at category generalization, they are usually much slower during learning and runtime, which makes them unsuitable for online applications.

For example, Amit et al. ^[19:1] proposed a coarse to fine approach that spreads gradient orientations in local neighborhoods. The amount of spreading is learned for each object part in an initial stage. While this approach— used for license plate reading—achieves high recognition rates, it is not real-time capable.

Histogram of Gradients (HoG) ^[1:7] is another related and very popular method. It statistically describes the distribution of intensity gradients in localized portions of the image. The approach is computed on a dense grid with uniform intervals and uses overlappinglocal histogram normalization for better performance. It has proven to give reliable results but tends to be slow due to the computational complexity.

Ferrari et al. ^[4:3] provided a learning-based method that recognizes objects via a Hough-style voting scheme with a nonrigid shape matcher on object boundaries of a binary edge image. The approach applies statistical methods to learn the model from few images that are only constrained within a bounding box around the object. While giving very good classification results, the approach is neither appropriate for object tracking in real time due to its expensive computation nor is it precise enough to return the accurate pose of the object. Additionally, it is sensitive to the results of the binary edge detector, an issue that we discussed before.

Kalal et al. ^[20:2] very recently developed an online learning-based approach. They showed how a classifier can be trained online in real time, with a training set generated automatically. However, as we will see in the experiments, this approach is only suitable for smooth background transitions and not appropriate to detect known objects over unknown backgrounds.

Opposite to the above-mentioned learning-based methods, there are also approaches that are specifically trained on different viewpoints. As with our template-based approach, they can detect objects under different poses, but typically require a large amount of training data and a long offline training phase. For example, in ^[5:3]^[21:1]^[22:1], one or several classifiers are trained to detect faces or cars under various views.

More recent approaches for 3D object detection are related to object class recognition. Stark et al. ^[23:1] rely on 3D CAD models and generate a training set by rendering them from different viewpoints. Liebelt and Schmid ^[24:1] combine a geometric shape and pose prior with natural images. Su et al. ^[25:1] use a dense, multiview representation of the viewing sphere combined with a part-based probabilistic representation. While these approaches are able to generalize to the object class, they are not real-time capable and require expensive training.

From the related works which also take into account depth data there are mainly approaches related to pedestrian detection ^[26:2]^[27:2]^[28:1]^[29:2]. They use three kinds of cues: image intensity, depth, and motion (optical flow). The most recent approach of Enzweiler et al. ^[26:3] builds part-based models of pedestrians in order to handle occlusions caused by other objects and not only selfocclusions modeled in other approaches ^[27:3]^[29:3]. Besides pedestrian detection, there has been an approach to object classification, pose estimation, and reconstruction introduced by Sun et al. ^[30:1]. The training data set is composed of depth and image intensities ,while the object classes are detected using the modified Hough transform. While quite effective in real applications, these approaches still require exhaustive training using large training data sets. This is usually prohibited in robotic applications, where the robot has to explore an unknown environment and learn new objects online.

As mentioned in the introduction, we recently proposed a method to detect textureless 3D object instances from different viewpoints based on templates ^[7:8]. Each object is represented as a set of templates, relying on local dominant gradient orientations to build a representation of the input images and the templates. Extracting the dominant orientations is useful to tolerate small translations and deformations. It is fast to perform and, most of the time, discriminant enough to avoid generating too many false positive detections.

However, we noticed that this approach degrades significantly when the gradient orientations are disturbed by stronger gradients of different orientations coming from background clutter in the input images. In practice, this often happens in the neighborhood of the silhouette of an object, which is unfortunate as the silhouette is a very important cue especially for textureless objects. The method we propose in this paper does not suffer from this problem while running at the same speed. Additionally, we show how to extend our approach to handle 3D surface normals at the same time if a dense depth sensor like the Kinect is available. As we will see, this increases the robustness significantly.

3 PROPOSED APPROACH

In this section, we describe our template representation and show how a new representation of the input image can be built and used to parse the image to quickly find objects. We will start by deriving our similarity measure, emphasizing the contribution of each aspect of it. We also show how we implement our approach to efficiently use modern processor architectures. Additionally, we demonstrate how to integrate depth data to increase robustness if a dense depth sensor is available.

Fig. 2. A toy duck with different modalities. Left: Strong and discriminative image gradients are mainly found on the contour. The gradient location $r_i$ is displayed in pink. Middle: If a dense depth sensor is available, we can also make use of surface 3D normals, which are mainly found on the body of the duck. The normal location $r_k$ is displayed in pink. Right: The combination of 2D image gradients and 3D surface normals leads to an increased robustness (see Section 2.9). This is due to the complementarity of the visual cues: Gradients are usually found on the object contour, while surface normals are found on the object interior.

3.1 Similarity Measure

Our unoptimized similarity measure can be seen as the measure defined by Steger in ^[18:8] modified to be robust to small translations and deformations. Steger suggests using

where $\operatorname{ori}(\mathcal{O}, r)$ is the gradient orientation in radians at location $r$ in a reference image $\mathcal{O}$ of an object to detect. Similarly, $\operatorname{ori}(\mathcal{I}, c+r)$ is the gradient orientation at $c$ shifted by $r$ in the input image $\mathcal{I}$ We use a list, denoted by $\mathcal{P},$ to define the locations $r$ to be considered in $\mathcal{O}$. This way we can deal with arbitrarily shaped objects efficiently. A template $\mathcal{T}$ is therefore defined as a pair $\mathcal{T} = (\mathcal{O},\mathcal{P})$.

Each template $\mathcal{T}$ is created by extracting a small set of its most discriminant gradient orientations from the corresponding reference image, as shown in Fig. 2, and by storing their locations. To extract the most discriminative gradients we consider the strength of their norms. In this selection process, we also take the location of the gradients into account to avoid an accumulation of gradient orientations in one local area of the object while the rest of the object is not sufficiently described. If a dense depth sensor is available, we can extend our approach with 3D surface normals, as shown on the right side of Fig. 2.

Considering only the gradient orientations and not their norms makes the measure robust to contrast changes, and taking the absolute value of the cosine allows it to correctly handle object occluding boundaries: It will not be affected if the object is over a dark background or a bright background.

The similarity measure of (1) is very robust to background clutter, but not to small shifts and deformations. A common solution is to first quantize the orientations and to use local histograms like in SIFT ^[6:9] or HoG ^[1:8]. However, this can be unstable when strong gradients appear in the background. In DOT ^[7:9], we kept the dominant orientations of a region. This was faster than building histograms, but suffers from the same instability. Another option is to apply Gaussian convolution to the orientations like in DAISY ^[31:1], but this would be too slow for our purpose. We therefore propose a more efficient solution. We introduce a similarity measure that, for each gradient orientation on the object, searches in a neighborhood of the associated gradient location for the most similar orientation in the input image. This can be formalized as

where $\mathcal{R}(c+r) = [c+r-\frac{T}{2},c+r+\frac{T}{2}] \times [c+r-\frac{T}{2},c+r+\frac{T}{2}]$ defines the neighborhood of size $T$ centered on location $c+r$ in the input image. Thus, for each gradient we align the local neighborhood exactly to the associated gradient location, whereas in DOT, the gradient orientation is adjusted only to some regular grid. We show below how to compute this measure efficiently.

3.2 Computing the Gradient Orientations

Before we continue with our approach, we shortly discuss why we use gradient orientations and how we extract them easily.

We chose to consider image gradients because they proved to be more discriminant than other forms of representations ^[6:10]^[18:9] and are robust to illumination change and noise. Additionally, image gradients are often the only reliable image cue when it comes to textureless objects. Considering only the orientation of the gradients and not their norms makes the measure robust to contrast changes, and taking the absolute value of cosine between them allows it to correctly handle object occluding boundaries: It will not be affected if the object is over a dark background or a bright background.

To increase robustness, we compute the orientation of the gradients on each color channel of our input image separately and for each image location use the gradient orientation of the channel whose magnitude is largest, as done in ^[1:9], for example. Given an RGB color image $\mathcal{I}$, we compute the gradient orientation map $\mathcal{I}_{\mathcal{G}}(x)$ at location $x$ with

\[\begin{aligned} &\mathcal{I}_{\mathcal{G}}(x)=\operatorname{ori}(\hat{\mathcal{C}}(x)),&\text { (3) } \end{aligned} \]

where

\[\begin{aligned} &\hat{\mathcal{C}}(x)=\underset{\mathcal{C} \in\{R, G, B\}}{\operatorname{argmax}}\left\|\frac{\partial \mathcal{C}}{\partial x}\right\|,&\text { (4) } \end{aligned} \]

and $R$, $G$, $B$ are the RGB channels of the corresponding color image.

In order to quantize the gradient orientation map we omit the gradient direction, consider only the gradient orientation, and divide the orientation space into $n_0$ equal spacings, as shown in Fig. 3. To make the quantization robust to noise, we assign to each location the gradient whose quantized orientation occurs most often in a $3{\times}3$ neighborhood. We also keep only the gradients whose norms are larger than a small threshold. The whole unoptimized process takes about 31 ms on the CPU for a VGA image.

Fig. 3. Upper left: Quantizing the gradient orientations: The pink orientation is closest to the second bin. Upper right: A toy duck with a calibration pattern. Lower left: The gradient image computed on a gray value image. The object contour is hardly visible. Lower right: Gradients computed with our method. Details of the object contours are clearly visible.

3.3 Spreading the Orientations

In order to avoid evaluating the max operator in (2) every time a new template must be evaluated against an image location, we first introduce a new binary representation— denoted by $ \mathcal{J}$ —of the gradients around each image location. We will then use this representation together with lookup tables to efficiently precompute these maximal values.

Fig. 4. Spreading the gradient orientations. Left: The gradient orientations and their binary code. We do not consider the direction of the gradients. a) The gradient orientations in the input image, shown in orange, are first extracted and quantized. b) Then, the locations around each orientation are also labeled with this orientation, as shown by the blue arrows. This allows our similarity measure to be robust to small translations and deformations. c) $ \mathcal{J}$ is an efficient representation of the orientations after this operation, and can be computed very quickly. For this figure, $T=3$ and $n_0=5$. In practice, we use $T=8$ and $n_0= 8$.

The computation of $ \mathcal{J}$ is depicted in Fig. 4. We first quantize orientations into a small number of $n_o$ values as done in previous approaches ^[6:11]^[1:10]^[7:10]. This allows us to “spread” the gradient orientations $\operatorname{ori}(\mathcal{I}, t)$ of the input image $\mathcal{I}$ around their locations to obtain a new representation of the original image.

For efficiency, we encode the possible combinations of orientations spread to a given image location $m$ using a binary string: Each individual bit of this string corresponds to one quantized orientation, and is set to 1 if this orientation is present in the neighborhood of $m$. The strings for all the image locations form the image $\mathcal{J}$ on the right part of Fig. 4. These strings will be used as indices to access lookup tables for fast precomputation of the similarity measure, as it is described in the next section.

$\mathcal{J}$ can be computed very efficiently: We first compute a map for each quantized orientation, whose values are set to 1 if the corresponding pixel location in the input image has this orientation and 0 if it does not. $\mathcal{J}$ is then obtained by shifting these maps over the range of $[-\frac{T}{2},+\frac{T}{2}] \times [-\frac{T}{2},+\frac{T}{2}]$ and merging all shifted versions with an OR operation.

3.4 Precomputing Response Maps

As shown in Fig. 5, $\mathcal{J}$ is used together with lookup tables to precompute the value of the max operation in (2) for each location and each possible orientation $\operatorname{ori}(\mathcal{O}, r)$ in the template. We store the results into 2D maps $S_i$. Then, to evaluate the similarity function, we will just have to sum values read from these $\mathcal{S}_i$s.

Fig. 5. Precomputing the response maps $\mathcal{S}_i$ . Left: There is one response map for each quantized orientation. They store the maximal similarity between their corresponding orientation and the orientations $\operatorname{ori}_j$ already stored in the “Invariant Image.” Right: This can be done very efficiently by using the binary representation of the list of orientations in $\mathcal{J}$ as an index to lookup tables of the maximal similarities.

Fig. 6. Restructuring the way the response images $\mathcal{S}_i$ are stored in memory. The values of one image row that are $T$ pixels apart on the $x$-axis are stored next to each other in memory. Since we have $T^2$ such linear memories per response map and $n_0$ quantized orientations, we end up with $T^2{\cdot}n_0$ different linear memories.

We use a lookup table $\tau_i$ for each of the $n_o$ quantized orientations, computed offline as

\[\begin{aligned} &\tau_i[\mathcal{L}]=\max _{l \in \mathcal{L}}|\cos (i-l)|,&(5) \end{aligned} \]

where

$i$ is the index of the quantized orientations; to keep the notations simple, we also use $i$ to represent the corresponding angle in radians;
$\mathcal{L}$ is a list of orientations appearing in a local neighborhood of a gradient with orientation $i$ as described in Section 3.3. In practice, we use the integer value corresponding to the binary representation of $\mathcal{L}$ as an index to the element in the lookup table.

For each orientation $i$, we can now compute the value at each location $c$ of the response map $\mathcal{S}_i$ as

\[\begin{aligned} &\mathcal{S}_i(c) = \tau_i[\mathcal{J}(c)]. &(6) \end{aligned} \]

Finally, the similarity measure of (2) can be evaluated as

\[\begin{aligned} &\mathcal{E}(\mathcal{I}, \mathcal{T}, c)=\sum_{r \in \mathcal{P}} \mathcal{S}_{\operatorname{ori}(\mathcal{O}, r)}(c+r). &\text { (7) } \end{aligned} \]

Since the maps $\mathcal{S}_i$ are shared between the templates, matching several templates against the input image can be done very fast once the maps are computed.

3.5 Linearizing the Memory for Parallelization

Thanks to (7), we can match a template against the whole input image by only adding the values in the response maps $\mathcal{S}_i$ . However, one of the advantages of spreading the orientations as was done in Section 3.3 is that it is sufficient to do the evaluation only every $T$th pixel without reducing the recognition performance. If we want to exploit this property efficiently, we have to take into account the architecture of modern computers.

Modern processors do not read only one data value at a time from the main memory but several ones simultaneously, called a cache line. Accessing the memory at random places results in a cache miss and slows down the computations. On the other hand, accessing several values from the same cache line is very cheap. As a consequence, storing data in the same order as they are read speeds up the computations significantly. In addition, this allows parallelization: For instance, if 8-bit values are used as it is the case for our $\mathcal{S}_i$ maps, SSE instructions can perform operations on 16 values in parallel. On multicore processors or on the GPU, even more operations can be performed simultaneously. For example, the NVIDIA Quadro GTX 590 can perform 1,024 operations in parallel.

Therefore, as shown in Fig. 6, we store the precomputed response maps $\mathcal{S}_i$ into memory in a cache-friendly way: We restructure each response map so that the values of one row that are T pixels apart on the $x$-axis are now stored next to each other in memory. We continue with the row which is $T$ pixels apart on the $y$-axis once we finished with the current one.

Finally, as described in Fig. 7, computing the similarity measure for a given template at each sampled image location can be done by adding the linearized memories with an appropriate offset computed from the locations $r$ in the templates.

Fig. 7. Using the linear memories. We can compute the similarity measure over the input image for a given template by adding up the linear memories for the different orientations of the template (the orientations are visualized by black arrows pointing in different directions), shifted by an offset depending on the relative locations $(r_x,r_y)^T$ in the template with respect to the anchor point. Performing these additions with parallel SSE instructions further speeds up the computation. In the final similarity map $\varepsilon$, only each $T$th pixel has to be parsed to find the object.

3.6 Extension to Dense Depth Sensors

In addition to color images, recent commodity hardware like the Kinect allows to capture dense depth maps in real time. If these depth maps are aligned to the color images, we can make use of them to further increase the robustness of our approach as we have recently shown in ^[32:1].

Fig. 8. Upper left: Quantizing the surface normals: The pink surface normal is closest to the precomputed surface normal v4. It is therefore put into the same bin as v4. Upper right: A person standing in an office room. Lower left: The corresponding depth image. Lower right: Surface normals computed with our approach. Details are clearly visible and depth discontinuities are handled well. We removed the background for visibility reasons

Similarly to the image cue, we decided to use quantized surface normals computed from a dense depth map in our template representation, as shown in Fig. 8. They allow us to represent both close and far objects while fine structures are preserved.

In the following, we propose a method for the fast and robust estimation of surface normals in a dense range image. Around each pixel location x, we consider the first order Taylor expansion of the depth function $\mathcal{D}(x)$

\[\begin{aligned} {\mathcal{D}(x+dx)-\mathcal{D}(x) = dx^T{\nabla}{D} + h.o.t. } &(8) \end{aligned} \]

Within a patch defined around $x$, each pixel offset $dx$ yields an equation that constrains the value of ${\nabla}\mathcal{D}$, allowing us to estimate an optimal gradient ${\nabla}\hat{\mathcal{D}}$ in a least-square sense. This depth gradient corresponds to a 3D plane going through three points $X$, $X_1$, and $X_2$:

\[\begin{aligned} X=\vec{v}\mathcal{D}(x) &\text{ (9) } \end{aligned} \]

\[\begin{aligned} X_1 = \vec{v}(x+[1,0]^T)(\mathcal{D}(x)+[1,0])\nabla{\hat{\mathcal{D}}}, &\text{ (10) } \end{aligned} \]

\[\begin{aligned} X_2 = \vec{v}(x+[1,0]^T)(\mathcal{D}(x)+[1,0])\nabla{\hat{\mathcal{D}}}, &(12) \end{aligned} \]

where $\vec{v}(x)$ is the vector along the line of sight that goes through pixel $x$ and is computed from the internal parameters of the depth sensor. The normal to the surface at the 3D point that projects on $x$ can be estimated as the normalized cross-product of $X_1 -X$ and $X_2 - X$.

However, this would not be robust around occluding contours, where the first order approximation of (8) no longer holds. Inspired by bilateral filtering, we ignore the contributions of pixels whose depth difference with the central pixel is above a threshold. In practice, this approach effectively smooths out quantization noise on the surface, while still providing meaningful surface normal estimates around strong depth discontinuities. Our similarity measure is then defined as the dot product of the normalized surface normals, instead of the cosine difference for the image gradients in (2). We otherwise apply the same technique we apply to the image gradients. The combined similarity measure is simply the sum of the measure for the image gradients and the one for the surface normals.

To make use of our framework we have to quantize the 3D surface normals into $n_0$ bins. This is done by measuring the angles between the computed normals and a set of $n_0$ precomputed vectors. These vectors are arranged in a circular cone shape originating from the peak of the cone pointing toward the camera. To make the quantization robust to noise, we assign to each location the quantized value that occurs most often in a $3 \times3$ neighborhood. The whole process is very efficient and needs only 14 ms on the CPU and less than 1 ms on the GPU.

3.7 Computation Time Study

In this section, we compare the numbers of operations required by the original method from ^[18:10] and the method we propose.

The time required by ${\varepsilon}_{Steger}$ from ^[18:11] to evaluate $R$ templates over an $M{\times}N$ image is$ M{\cdot}N R{\cdot}G( S+A)$ , with $G$ the average number of gradients in a template, S the time to evaluate the similarity function between two gradient orientations, and $A$ the time to add two values.

Changing ${\varepsilon}_{Steger}$ to (2) and making use of $\mathcal{J}$ leads to a computation time of $M{\cdot}N{\cdot}T^2{\cdot}O + \frac{M{\cdot}N}{T^2}{\cdot}R{\cdot}G(L+A)$, where $L$ is the time needed for accessing once the lookup tables $\tau_{i}$ and $O$ is the time to $OR$ two values together. The first term corresponds to the time needed to compute $\mathcal{J}$ , the second one to the time needed to actually compute (2).

Precomputing the response maps $\mathcal{S}_{i}$ further changes the complexity of our approach to $M{\cdot}N{\cdot}(T^2{\cdot}O +n_0{\cdot}L)+ \frac{M{\cdot}N}{T^2}{\cdot}R{\cdot}G{\cdot}A$.

Linearizing our memory allows the additional use of parallel SSE instructions. In order to run 16 operations in parallel, we approximate the response values in the lookup tables using bytes. The final complexity of our algorithm is then $M{\cdot}N{\cdot}(T^2{\cdot}O +(n_0+1){\cdot}L)+ \frac{M{\cdot}N}{16T^2}{\cdot}R{\cdot}G{\cdot}A$.

In practice, we use $T= 8$, $M =480$, $N =640$, $R > 1,000$, $G{\approx}100$, and $n_0 =8$. If we assume for simplicity that $ L{\approx}A{\approx}O{\approx}1$ time unit, this leads to a speed improvement compared to the original energy formulation ${\varepsilon}_{Steger}$ of a factor $T^2{\cdot}16(1+S)$ if we assume that the number of templates $R$ is large. Note that we did not incorporate the cache friendliness of our approach since it is very hard to model. Still, since ^[18:12] evaluates the similarity of two orientations with the normalized dot product of the two corresponding gradients, S can be set to 3 and we obtain a theoretical gain in speed of at least a factor of 4,096.

3.8 Experimental Validation

We compared our approach, which we call LINE (for LINEearizing the memory), to DOT ^[7:11], HOG ^[1:11], TLD ^[20:3], and the Steger method ^[18:13]. For these experiments we used three different variations of LINE: LINE-2D that uses the image gradients only, LINE-3D that uses the surface normals only, and LINE-MOD, for multimodal, which uses both.

Fig. 9. Combining many modalities results in a more discriminative response function. Here, we compare LINE-MOD against LINE-2D on the shown image. We plot the response function of both methods with respect to the true location of the monkey. One can see that the response of LINE-MOD exhibits a single and discriminative peak, whereas LINE-2D has several peaks which are of comparable height. This is one explanation why LINE-MOD works better and produces fewer false positives.

DOT is a representative for fast template matching while HOG and Steger stand for slow but very robust template matching. In contrast to them, TLD represents very recent insights in online learning.

Instead of gray value gradients, we used the color gradient method of Section 3.2 for DOT, HOG, and Steger, which resulted in an enhancement of recognition. Moreover, we used the author’s implementations for DOT and TLD. For the approach of Steger we used our own implementation with four pyramid levels. For HOG, we also used our own optimized implementation and replaced the Support Vector Machine mentioned in the original work of HOG by a nearest neighbor search. In this way, we can use it as a robust representation and quickly learn new templates as with the other methods.

The experiments were performed on one processor of a standard notebook with an Intel Centrino Processor Core2Duo with 2.4 GHz and 3 GB of RAM. For obtaining the image and the depth data we used the Primesense PSDK 5.0 device.

Fig. 10. Left: Our new approach runs in real time and can parse a $640{\times}480$ image with over 3,000 templates at about 10 fps. Middle: Our new approach is linear with respect to occlusion. Right: Average recognition score for the six objects of Section 2.9 with respect to occlusion.

3.9 Robustness

We used six sequences made of over 2,000 real images each. Each sequence presents illumination and large viewpoint changes over a heavily cluttered background. Ground truth is obtained with a calibration pattern attached to each scene that enables us to know the actual location of the object. The templates were learned over a homogeneous background.

We consider the object to be correctly detected if the location given back is within a fixed radius of the ground truth position.

Fig. 11. Comparison of LINE-2D, which is based on gradients, LINE-3D, which is based on normals, and LINE-MOD, which uses both cues, to DOT [7], HOG [1], Steger [18], and TLD [20] on real 3D objects. Each row corresponds to a different sequence (made of over 2,000 images each) on heavily cluttered background: a monkey, a duck, and a camera. The approaches were trained on a homogeneous background. Left: Percentage of true positives plotted against the average percentage of false positives. LINE-2D outperforms all other image-based approaches when considering the combination of robustness and speed. If a dense depth sensor is available, we can extend LINE to 3D surface normals, resulting in LINE-3D and LINE-MOD. LINE-MOD provides about the same recognition rates for all objects, while the other approaches have a much larger variance depending on the object type. LINE-MOD outperforms the other approaches in most cases. Middle: The distribution of true and false positives plotted against the threshold. In the case of LINE-MOD, they are better separable from each other than in the case of LINE-2D. Right: One sample image of the corresponding sequence shown with the object detected by LINE-MOD.

Fig. 12. The same experiments as shown in Fig. 11, for different objects: a cup, a toy car, and a hole punch. These values were evaluated on 2,000 images for each object.

As we can see in the left columns of Figs. 11 and 12, LINE-2D mostly outperforms all other image-based approaches. The only exception is the method of Steger, which gives similar results. This is because our approach and the one of Steger use similar score functions. However, the advantage of our method in terms of computation times is very clear from Fig. 10.

The reason for the weak detection results of TLD is that while this method works well under smooth background transition, it is not suitable to detect known objects over unknown backgrounds.

If a dense depth sensor is available we can further increase the robustness without becoming slower at runtime. This is depicted in the left columns of Figs. 11 and 12, where LINE-MOD always outperforms all the other approaches and shows only a few false positives. We believe that this is due to the complementarity of the object features that compensate for the weaknesses of each other (see Fig. 9). The depth cue alone often performs not very well.

TABLE 1 True and False Positive Rates for Different Thresholds on the Similarity Measure of Different Methods

Squence	LINE-MOD	LINE-3D	LINE-2D	HOG	DOT	Steger	TLD
Toy-Monkey (2164 pics)	97.9%-0.3%	86.1%-13.8%	50.8%-49.1	51.8%-48.2%	8.6%-91.4%	69.6%-30.3%	0.8%-99.1%
Camera (2173 pics)	97.5%-0.3%	61.9%-38.1%	92.8%-6.7%	18.2%-81.8%	1.9%-98.0%	96.9%-0.4%	53.3%-46.6%
Toy-Car (2162 pics)	97.7%-0.0%	95.6%-2.5%	96.9%-0.4%	44.1%-55.9%	34.0%-66.0%	83.6%-16.3%	0.1%-98.9%
Cup (2193 pics)	96.8%-0.5%	88.3%-10.6%	92.8%-6.0%	81.1%-18.8%	64.1%-35.8%	90.2%-8.9%	10.4%-89.6%
Toy-Duck (2223 pics)	97.9%-0.0%	89.0%-10.0%	91.7%-8.0%	87.6%-12.4%	78.2%-21.8%	92.2%-7.6%	28.0%-71.9%
Hole punch (2184 pics)	97.0%-0.2%	70.0%-30.0%	96.4%-0.9%	92.6%-7.4%	87.7%-12.0%	90.3%-9.7%	26.5%-73.4%

In some cases, no hypotheses were given back, so the sum of true and false positives can be lower than 100 percent. LINE-2D outperforms all other image-based approaches when taking into account the combination of performance rate and speed. If a dense depth sensor is available, our LINEMOD approach obtains very high recognition rates at the cost of almost no false positives, and outperforms all the other approaches.

The superiority of LINE-MOD becomes more obvious in Table 1: If we set the threshold for each approach to allow for 97 percent true positive rate and only evaluate the hypothesis with the largest response, we obtain for LINEMOD a high detection rate with a very small false positive rate. This is in contrast to LINE-2D, where the true positive rate is often over 90 percent, but the false positive rate is not negligible. The true positive rate is computed as the ratio of correct detections and the number of images; similarly, the false positive rate is the ratio of the number of incorrect detections and the number of images.

One reason for this high robustness is the good separability of the multimodal approach as shown in the middle of Figs. 11 and 12: In contrast to LINE-2D, where we have a significant overlap between true and false positives, LINE-MOD separates at a specific threshold— about 80 in our implementation—almost all true positives well from almost all false positives. This has several advantages. First, we will detect almost all instances of the object by setting the threshold to this specific value. Second, we also know that almost every returned template with a similarity score above this specific value is a true positive. Third, the threshold is always around the same value, which supports the conclusion that it might also work well for other objects.

3.10 Speed

Learning new templates only requires extracting and storing the image features (and, if used, the depth features), which is almost instantaneous. Therefore, we concentrate on runtime performance.

The runtimes given in Fig. 10 show that the general LINE approach is real time and can parse a VGA image with over 3,000 templates with about 10 fps on the CPU. The small difference of computation times between LINE-MOD and LINE-2D and LINE-3D comes from the slightly slower preprocessing step of LINE-MOD, which includes the two preprocessing steps of LINE-2D and LINE-3D.

DOT is initially faster than LINE but becomes slower as the number of templates increases. This is because the runtime of LINE is independent of the template size, whereas the runtime of DOT is not. Therefore, to handle larger objects DOT has to use larger templates, which makes the approach slower once the number of templates increases.

Our implementation of Steger et al. is approximately 100 times slower than our LINE-MOD method. Note that we use four pyramid levels for more efficiency, which is one of the reasons for the different speed improvement given in Section 3.7, where we assumed no image pyramid.

TLD uses a tree classifier similar to ^[33:1], which is the reason why the timings stay relatively equal with respect to the number of templates. Since this paper is concerned with detection, for this experiment we consider only the detection component of TLD and not the tracking component.

3.11 Occlusion

We also tested the robustness of LINE-2D and LINE-MOD with respect to occlusion. We added synthetic noise and illumination changes to the images, incrementally occluded the six different objects of Section 3.9 and measured the corresponding response values. As expected, the similarity measures used by LINE-2D and LINE-MOD behave linearly in the percentage of occlusion, as reported in Fig. 10. This is a desirable property since it allows detection of partly occluded templates by setting the detection threshold with respect to the tolerated percentage of occlusion.

We also experimented with real scenes where we first learned our six objects in front of a homogeneous background and then added heavy 2D and 3D background clutter. For recognition we incrementally occluded the objects. We define our object as correctly recognized if the template with the highest response is found within a fixed radius of the ground truth object location. The average recognition result is displayed in Fig. 10: With 20 percent occlusion for LINE-2D and with over 30 percent occlusion for LINE-MOD we are still able to recognize objects.

3.12 Number of Templates

Fig. 17. An arbitrary object can be detected using approximately 2,000 templates. The half sphere represents the detection range in terms of tilt and inclination rotations. Additionally, in-plane rotations of ${\pm}80$ degrees and scale changes in the range from$[1.0, 2.0]$ can be handle

We discuss here the average number of templates needed to detect an arbitrary object from a large number of viewpoints. In our implementation, approximately 2,000 templates are needed to detect an object with 360 degree tilt rotation, 90 degree inclination rotation and in-plane rotations of 80 degree—tilt and inclination cover the half-sphere of Fig. 17. With the number of templates given here, the detection works for scale changes in the range of $[1.0, 2.0]$.

3.13 Examples

Fig. 14. Different textureless 3D objects are detected with LINE-2D in real time under different poses in difficult outdoor scenes with partial occlusion, illumination changes, and strong background clutter.

Fig. 15. Different textureless 3D objects are detected with LINE-2D in real time under different poses on heavily cluttered background with partial occlusion.

Fig. 16. Different textureless 3D objects detected simultaneously in real time by our LINE-MOD method under different poses on heavily cluttered background with partial occlusion.

Figs. 14, 15, and 16 show the output of our methods on textureless objects in different heavy cluttered inside and outside scenes. The objects are detected under partial occlusion, drastic pose, and illumination changes. In Figs. 14 and 15, we only use gradient features, whereas in Fig. 16, we also use 3D normal features. Note that we could not apply LINE-MOD outside since the Primesense device was not able to produce a depth map under strong sunlight.

3.14 Failure Cases

Fig. 13. Typical failure cases. Motion blur can produce (a) false negative: the red car is not detected and (b) false positive: the duck is detected on the background. Similar structures can also produce false positives: (c) the monkey statue is detected on a bowl, and (d) the templates for the hole punch seen under some viewpoints are not discriminative and one is detected here on a structure with orthogonal lines.

Fig. 13 shows the limitations of our method. It tends to produce false positives and false negatives in case of motion blur. False positives and false negatives can also be produced when some templates are not discriminative enough.

4 CONCLUSION

We presented a new method that is able to detect 3D textureless objects in real time under heavily background clutter, illumination changes, and noise. We also showed that if a dense depth sensor is available, 3D surface normals can be robustly and efficiently computed and used together with 2D gradients to further increase the recognition performance. We demonstrated how to take advantage of the architecture of modern computers to build a fast but very discriminant representation of the input images that can be used to consider thousands of arbitrarily sized and arbitrarily shaped templates in real time. Additionally, we have shown that our approach outperforms state-of-the-art methods with respect to the combination of recognition rate and speed, especially in heavily cluttered environments.

ACKNOWLEDGMENTS

The authors thank Stefan Holzer and Kurt Konolige for the useful discussions and their valuable suggestions. This project was funded by the BMBF project AVILUSplus (01IM08002). P. Sturm is grateful to the Alexander-vonHumboldt Foundation for a Research Fellowship supporting his sabbatical at TU Mu¨nchen. Nassir Navab and Vincent Lepetit are joint senior authors of this paper.

REFERENCES

N. Dalal and B. Triggs, “Histograms of Oriented Gradients forHuman Detection,” Proc. IEEE Conf. Computer Vision and PatternRecognition, 2005. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
R. Fergus, P. Perona, and A. Zisserman, “Weakly SupervisedScale-Invariant Learning of Models for Visual Recognition,” Int’lJ. Computer Vision, 2006. ↩︎ ↩︎
A. Bosch, A. Zisserman, and X. Munoz, “Image ClassificationUsing Random Forests,” Proc. IEEE Int’l Conf. Computer Vision,2007. ↩︎ ↩︎
V. Ferrari, F. Jurie, and C. Schmid, “From Images to Shape Modelsfor Object Detection,” Int’l J. Computer Vision, 2009. ↩︎ ↩︎ ↩︎ ↩︎
P. Viola and M. Jones, “Fast Multi-View Face Detection,” Proc.IEEE Conf. Computer Vision and Pattern Recognition, 2003. ↩︎ ↩︎ ↩︎ ↩︎
D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int’l J. Computer Vision, vol. 20, no. 2, pp. 91-110, 2004. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
S. Hinterstoisser, V. Lepetit, S. Ilic, P. Fua, and N. Navab,“Dominant Orientation Templates for Real-Time Detection ofTexture-Less Objects,” Proc. IEEE Conf. Computer Vision and PatternRecognition, 2010. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
M. Muja, R. Rusu, G. Bradski, and D. Lowe, “Rein—A Fast,Robust, Scalable Recognition Infrastructure,” Proc. Int’l Conf.Robotics and Automation, 2011. ↩︎ ↩︎
G. Borgefors, “Hierarchical Chamfer Matching: A ParametricEdge Matching Algorithm,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 10, no. 6, pp. 849-865, Nov. 1988. ↩︎ ↩︎ ↩︎ ↩︎
D. Huttenlocher, G. Klanderman, and W. Rucklidge, “ComparingImages Using the Hausdorff Distance,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 15, no. 9, pp. 850-863, Sept. 1993. ↩︎ ↩︎ ↩︎ ↩︎
D. Gavrila and V. Philomin, “Real-Time Object Detection for‘Smart’ Vehicles,” Proc. IEEE Int’l Conf. Computer Vision, 1999. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
C.F. Olson and D.P. Huttenlocher, “Automatic Target Recognitionby Matching Oriented Edge Pixels,” IEEE Trans. Image Processing,vol. 6, no. 1, pp. 103-113, Jan. 1997. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
G. Blake, R. Dreslinski, and T. Mudge, “A Survey of MulticoreProcessors,” IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 26-37, Nov. 2009. ↩︎ ↩︎
S. Hinterstoisser, V. Lepetit, S. Benhimane, P. Fua, and N. Navab,“Learning Real-Time Perspective Patch Rectification,” Int’lJ. Computer Vision, vol. 91, pp. 107-130, 2011. ↩︎ ↩︎
W. Rucklidge, “Efficiently Locating Objects Using the HausdorffDistance,” Int’l J. Computer Vision, vol. 24, pp. 251-270, 1997. ↩︎ ↩︎
S. Holzer, S. Hinterstoisser, S. Ilic, and N. Navab, “DistanceTransform Templates for Object Detection and Pose Estimation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,2009. ↩︎ ↩︎
J. Canny, “A Computational Approach to Edge Detection,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6,pp. 679-698, Nov. 1986. ↩︎ ↩︎
C. Steger, “Occlusion Clutter, and Illumination Invariant ObjectRecognition,” Int’l Archives of Photogrammetry and Remote Sensing,vol. 34, 2002. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Y. Amit, D. Geman, and X. Fan, “A Coarse-to-Fine Strategy forMulti-Class Shape Detection,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 26, no. 12, pp. 1606-1621, Dec. 2004. ↩︎ ↩︎
Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints,” Proc. IEEEConf. Computer Vision and Pattern Recognition, 2010. ↩︎ ↩︎ ↩︎ ↩︎
C. Huang, H. Ai, Y. Li, and S. Lao, “Vector Boosting for RotationInvariant Multi-View Face Detection,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2005. ↩︎ ↩︎
M. Ozuysal, V. Lepetit, and P. Fua, “Pose Estimation for CategorySpecific Multiview Object Localization,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, June 2009. ↩︎ ↩︎
M. Stark, M. Goesele, and B. Schiele, “Back to the Future: LearningShape Models from 3D CAD Data,” Proc. British Machine VisionConf., 2010. ↩︎ ↩︎
J. Liebelt and C. Schmid, “Multi-View Object Class Detection witha 3D Geometric Model,” Proc. IEEE Conf. Computer Vision andPattern Recognition, 2010. ↩︎ ↩︎
H. Su, M. Sun, L. Fei-Fei, and S. Savarese, “Learning a DenseMulti-View Representation for Detection, Viewpoint Classification and Synthesis of Object Categories,” Proc. IEEE Int’l Conf.Computer Vision, 2009. ↩︎ ↩︎
M. Enzweiler, A. Eigenstetter, B. Schiele, and D.M. Gavrila,“Multi-Cue Pedestrian Classification with Partial OcclusionHandling,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010. ↩︎ ↩︎ ↩︎ ↩︎
A. Ess, B. Leibe, and L.J.V. Gool, “Depth and Appearance forMobile Scene Analysis,” Proc. IEEE Int’l Conf. Computer Vision,2007. ↩︎ ↩︎ ↩︎ ↩︎
D.M. Gavrila and S. Munder, “Multi-Cue Pedestrian Detectionand Tracking from a Moving Vehicle,” Int’l J. Computer Vision,vol. 73, pp. 41-59, 2007. ↩︎ ↩︎
C. Wojek, S. Walk, and B. Schiele, “Multi-Cue Onboard PedestrianDetection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. ↩︎ ↩︎ ↩︎ ↩︎
M. Sun, G.R. Bradski, B.-X. Xu, and S. Savarese, “Depth-EncodedHough Voting for Joint Object Detection and Shape Recovery,”Proc. European Conf. Computer Vision, 2010. ↩︎ ↩︎
E. Tola, V. Lepetit, and P. Fua, “Daisy: An Efficient DenseDescriptor Applied to Wide Baseline Stereo,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 32, no. 5, pp. 815-830, May2010. ↩︎ ↩︎
S. Hinterstoisser, C. Cagniart, S. Holzer, S. Ilic, K. Konolige, N.Navab, and V. Lepetit, “Multimodal Templates for Real-TimeDetection of Texture-Less Objects in Heavily Cluttered Scenes,”Proc. IEEE Int’l Conf. Computer Vision, Submitted, 2011. ↩︎ ↩︎
M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua, “Fast Keypoint Online Learning and Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 3, pp. 448-461, Mar. 2010. ↩︎ ↩︎

posted @ 2024-08-23 22:19 GShang 阅读(102) 评论(0) 收藏举报

刷新页面返回顶部

GShang

Stay foolish, Stay hungry.