上下文利用的方式
语义分割是需要整合来自各种空间尺度的信息的问题。它还意味着平衡本地和全球信息。一方面,细粒度或局部信息对于实现良好的像素级精度至关重要。另一方面,整合来自图像的全局上下文的信息以便能够解决局部模糊也是重要的。香草CNN与这种平衡斗争。汇集层,允许网络实现某种程度的空间不变性并保持计算成本,处理全局上下文信息。即使是纯粹的CNN--没有汇集层 - 也是有限的,因为它们的单元的接收场只能随着层的数量线性增长。可以采用许多方法使CNN知道全局信息:作为条件随机场(CRF)的后处理步骤,扩散卷积,多尺度聚合,或者甚至将上下文建模推迟到另一种深度网络,例如作为RNN。
条件随机场。正如我们之前提到的,CNN架构的空间变换的固有不变性限制了分段任务的相同空间精度。一种可能且常见的方法来重新确定分段系统的输出并提高其捕获细粒度细节的能力是使用条件随机场(CRF)应用后处理阶段。 CRF可以将低级图像信息(例如像素之间的交互[101,102])与产生每像素级别分数的多类推理系统的输出相结合。这种组合对于捕获CNN未能考虑的长距离依赖性以及本地细节尤其重要。 DeepLab模型[72,73]利用Krähenbühl和Koltun [103,104]的完全连接的成对CRF作为其管道中的分离后处理步骤来重新确定分割结果。它将每个像素建模为场中的节点,并且对于每对像素采用一对成对项,无论它们位于多远(该模型称为密集或完全连通因子图)。通过使用该模型,考虑短期和长程交互,使得系统能够恢复由于CNN的空间不变性而丢失的分段中的详细结构。尽管通常完全连通的模型效率低下,但这种模型可以通过概率推理得到有效的近似。图11显示了这种基于CRF的后处理对DeepLab模型产生的得分和信念图的影响。贝尔等人在野外网络中的物质识别。 [46]利用受过训练的各种CNN来识别MINC数据库中的补丁。这些CNN用于滑动窗口方式以对这些补丁进行分类。通过添加相应的上采样层,将它们的权重转移到转换为FCN的相同网络。平均输出以生成概率图。最后,来自DeepLab的相同CRF,但离散优化,用于预测和重新定义每个像素的材料。应用CRF来重新定义FCN的另一个重要工作是Zheng等人的CRFasRNN。 [74]。这项工作的主要贡献是重组CRF的重新配制,成对电位作为网络的组成部分。通过将平均场推断步骤展开为RNN,它们可以将CRF与FCN完全集成,并培训整个网络的端到端。与Pinheiro等人相比,这项工作证明了CRF重新构建为RNN,形成深层网络的一部分。 [86]采用RNN来模拟大的空间依赖性。扩张的卷积。扩张的卷积,也称为àtrous卷积,是Kronecker因子卷积滤波器[105]的推广,它支持指数扩展的接收场而不会丢失分辨率。换句话说,扩张的卷积是使用上采样滤波器的常规卷积。扩张率l控制上采样因子。如图12所示,堆叠l-扩张卷积使得接收场呈指数增长,而滤波器的参数数量保持线性增长。这意味着扩张的卷积允许在任意分辨率上提取高效的密集特征。作为旁注,重要的是要注意典型的卷积只是1扩散的卷积。在实践中,它相当于在进行通常的卷积之前扩张滤波器。这意味着根据膨胀率扩大其大小,同时用零填充空元素。换句话说,如果膨胀率大于1,则过滤器重量与不相邻的远离元件匹配。图13显示了扩张过滤器的例子。使用扩张卷积的最重要的工作是Yu等人的多尺度上下文聚合模块。 [75]
使用扩张卷积的最重要的工作是Yu等人的多尺度上下文聚合模块。 [75],已经提到的DeepLab(其改进版本)[73],以及实时网络ENet [76]。它们都使用扩张卷积的组合和不断增加的扩张率,以获得更宽的接收场,无需额外成本,也无需对特征图进行过度下采样。这些作品也表现出一种共同的趋势:扩张的卷积与多尺度的上下文聚合紧密耦合,我们将在下一节中解释。多尺度预测。处理上下文知识集成的另一种可能方式是使用多尺度预测。 CNN的几乎每个参数都会影响生成的要素图的比例。换句话说,完全相同的架构将对输入图像的像素数量产生影响,其对应于特征图的像素。这意味着过滤器将隐含地学习检测特定尺度的特征(可能具有一定的不变度)。此外,这些参数通常与手头的问题紧密耦合,使得模型难以推广到不同的尺度。克服该障碍的一种可能方式是使用多尺度网络,其通常利用针对不同尺度的多个网络,然后合并预测以产生单个输出。 Raj等人。 [77]提出了完全卷积VGG-16的多尺度版本。该网络有两条路径,一条以原始分辨率处理输入,另一条将其加倍。第一条路径通过一个浅的卷积网络。第二个通过完全卷积的VGG-16和一个额外的卷积层。第二条路径的结果被上采样并与第一条路径的结果相结合。然后,该级联输出经过另一组卷积层以生成最终输出。结果,网络变得更加健壮以适应规模变化。罗伊等人。 [79]采用不同的方法,使用由四个多尺度CNN组成的网络。这四个网络具有由Eigen等人引入的相同架构。 [78]。其中一个网络致力于为场景发现语义标签。该网络从渐进的粗到细的比例序列中提取特征(见图14)。另一项值得注意的工作是Bian等人提出的网络。 [80]。该网络是由不同规模的n个FCN组成的。从网络中提取的特征被融合在一起(在必要的上采样之后用适当的填充),然后它们通过另外的卷积层来产生最终的分割。这种架构的主要贡献是两阶段学习过程,首先是独立地训练每个网络,然后组合网络,最后一层进行调整。这种多尺度模型允许以有效的方式添加任意数量的新训练的网络。
特征融合:将上下文信息添加到用于分段的完全卷积架构的另一种方式是特征融合。该技术包括将全局特征(从网络中的先前层提取)与从后续层提取的更局部特征映射合并。诸如原始FCN的常见体系结构利用跳过连接通过组合从不同层提取的特征图来执行后期融合(参见图15)。另一种方法是进行早期融合。 ParseNet [81]在其上下文模块中采用了这种方法。将全局特征解除为与本地特征相同的空间大小,然后将它们连接起来以生成在下一层中使用的组合特征或学习分类器。图16显示了该过程的表示。 Pinheiro等人继续这一特征融合思想。在他们的SharpMask网络[89]中,它引入了一个渐进式改进模块,在自上而下的架构中融合了前一层到下一层的特性。这项工作将在稍后进行审核,因为它主要侧重于实例细分。与ParseNet合并全局特征以及扩展FCN [72,75]所执行的池化操作相反,金字塔池化经验证明了基于不同区域的上下文聚合提取全局特征的能力[82]。图17示出了金字塔场景解析网络(PSPNets)24,其提供了聚焦到四个不同金字塔比例的特征融合的金字塔解析模块,以便嵌入来自复杂场景的全局上下文。金字塔等级和每个等级的大小可以任意修改。面向基于FCN的模型的PSPNet的更好性能在于:(1)缺乏收集上下文信息的能力,(2)缺乏类别关系和(3)不使用子区域。该方法在各种数据集上实现了最先进的性能。递归神经网络。正如我们所注意到的,CNN已成功应用于多维数据,如图像。然而,这些网络依赖于手工指定的内核,将架构限制在本地环境中。利用其拓扑结构,回归神经网络已成功应用于短时间和长时间序列的建模。通过这种方式并通过将像素级和本地信息链接在一起,RNN能够成功地建模全局上下文并改进语义分割。然而,一个重要的问题是图像中缺乏自然的顺序结构以及标准的香草RNN架构对一维输入的关注。基于ReNet模型的图像分类Visin等。 [20]提出了一种语义分割的体系结构,称为ReSeg [83],如图18所示。在这种方法中,输入图像是用VGG-16网络的第一层处理的[15],将得到的特征图输入到一个或更多ReNet图层用于微调。最后,使用基于转置卷积的上采样层来调整特征映射的大小。在这种方法中,使用了门控循环单元(GRU),因为它们在内存使用和计算能力方面取得了良好的性能平衡。 Vanilla RNN在长期依赖性建模方面存在问题,主要是由于梯度消失问题。诸如长短期记忆(LSTM)网络[106]和GRU [107]之类的几种衍生模型是该领域中用于避免此类问题的最新技术。在相同的ReNet架构的启发下,[108]提出了一种用于场景标记的新型Long ShortTerm Memorized Context Fusion(LSTM-CF)模型。在这种方法中,他们使用两种不同的数据源:RGB和深度。 RGB管道依赖于DeepLab体系结构的变体[32]在三个不同尺度上连接特征以丰富特征表示(受[109]的启发)。全局上下文在深度和光度数据源上垂直建模,在这些垂直上下文中在两个方向上进行水平融合。正如我们所注意到的,通过在输入图像上垂直和水平展开网络,建模图像全局上下文与2D循环方法相关。基于同样的想法,Byeon等人。 [85]目的是一种简单的基于2D LSTM的架构,其中输入图像被分成非重叠窗口,这些窗口被馈送到四个单独的LSTM存储器块中。这项工作强调其在单核CPU上的低计算复杂性和模型简单性。捕获全局信息的另一种方法依赖于使用更大的输入窗口来模拟更大的上下文。然而,这降低了图像分辨率并且还暗示了与窗口重叠有关的若干问题。但是,Pinheiro等人。 [86]引入了循环卷积神经网络(rCNNs),它通过使用不同的输入窗口大小考虑先前的预测,以不同的输入窗口大小循环训练。
实例分割被认为是语义分割之后的下一步,同时也是与其他低级像素分割技术相比最具挑战性的问题。其主要目的是表示分成不同实例的同一类的对象。该过程的自动化并不简单,因此实例的数量最初是未知的,并且所执行的预测的评估不是像素化的,例如在语义分割中。因此,这个问题仍未得到部分解决,但对该领域的兴趣是由其潜在的适用性所驱动的。实例标签为我们提供了关于遮挡情况的推理的额外信息,还计算了属于同一类的元素的数量,以及用于检测机器人任务中的特定对象,以及许多其他应用程序。为此,Hariharan等人。 [10]提出了一种同时检测和分割(SDS)方法,以提高已有工作的性能。他们的管道首先使用自下而上的分层图像分割和称为多尺度组合分组(MCG)[110]的对象候选生成过程来获得区域提议。对于每个区域,通过使用Region-CNN(RCNN)[111]的改编版本来提取特征,其使用由MCG方法提供的边界框而不是选择性搜索以及区域前景特征来进行微调。然后,通过在CNN特征之上使用线性支持向量机(SVM)对每个区域提议进行分类。最后,为了实现目的,非最大抑制(NMS)应用于先前的提议。后来,Pinheiro等人。 [88]介绍了DeepMask模型,一种基于单个ConvNet的对象提议方法。此模型预测输入补丁的分段掩码以及此补丁包含对象的可能性。这两个任务是由一个网络共同学习和计算的,共享大部分层,除了最后一个任务特定层。基于DeepMask体系结构作为其有效性的起点,同一作者提出了一种新颖的对象实例分割体系结构,实现了自上而下的改进过程[89],并在准确性和速度方面实现了更好的性能。此过程的目标是有效地将低级功能与来自上层网络层的高级语义信息合并。该过程包括堆叠在一起的不同改进模块(每个池化层一个模块),目的是通过生成新的上采样对象编码来反转合并效果。图19显示了SharpMask中的改进模块。 Zagoruyko等人提出了另一种基于快速R-CNN作为起点并使用DeepMask对象提议而不是选择性搜索的方法。 [90]。这个组合系统称为MultiPath分类器,改进了COCO数据集的性能,并假设对快速R-CNN的三种修改:通过使用中心凹区域提供上下文,通过使用中央凹区域提供上下文,为网络提供多尺度特征。该系统比基线Fast R-CNN提高了66%。如我们所见,上面提到的大多数方法依赖于现有的物体检测器以这种方式限制模型性能。即便如此,实例分割过程仍然是一个尚未解决的研究问题,所提到的工作只是这个具有挑战性的研究课题的一小部分。
[' Semantic segmentation is a problem that requires the integration of information from various spatial scales. It also implies balancing local and global information. On the one hand, finegrained or local information is crucial to achieve good pixel-level accuracy. On the other hand, it is also important to integrate information from the global context of the image to be able to resolve local ambiguities. Vanilla CNNs struggle with this balance. Pooling layers, which allow the networks to achieve some degree of spatial invariance and keep computational cost at bay, dispose of the global context information. Even purely CNNs – without pooling layers – are limited since the receptive field of their units can only grow linearly with the number of layers. Many approaches can be taken to make CNNs aware of that global information: refinement as a post-processing step with Conditional Random Fields (CRFs), dilated convolutions, multi-scale aggregation, or even defer the context modeling to another kind of deep networks such as RNNs. Conditional Random Fields. As we mentioned before, the inherent invariance to spatial transformations of CNN architectures limits the very same spatial accuracy for segmentation tasks. One possible and common approach to refine the output of a segmentation system and boost its ability to capture fine-grained details is to apply a post-processing stage using a Conditional Random Field (CRF). CRFs enable the combination of low-level image information – such as the interactions between pixels [101,102] – with the output of multi-class inference systems that produce per-pixel class scores. That combination is especially important to capture longrange dependencies, which CNNs fail to consider, and fine local details. The DeepLab models [72,73] make use of the fully connected pairwise CRF by Krähenbühl and Koltun [103,104] as a separated post-processing step in their pipeline to refine the segmentation result. It models each pixel as a node in the field and employs one pairwise term for each pair of pixels no matter how far they lie (this model is known as dense or fully connected factor graph). By using this model, both short and long-range interactions are taken into account, rendering the system able to recover detailed structures in the segmentation that were lost due to the spatial invariance of the CNN. Despite the fact that usually fully connected models are inefficient, this model can be efficiently approximated via probabilistic inference. Fig. 11 shows the effect of this CRF-based post-processing on the score and belief maps produced by the DeepLab model. The material recognition in the wild network by Bell et al. [46] makes use of various CNNs trained to identify patches in the MINC database. Those CNNs are used on a sliding window fashion to classify those patches. Their weights are transferred to the same networks converted into FCN by adding the corresponding upsampling layers. The outputs are averaged to generate a probability map. At last, the same CRF from DeepLab, but discretely optimized, is applied to predict and refine the material at every pixel. Another significant work applying a CRF to refine the segmentation of a FCN is the CRFasRNN by Zheng et al. [74]. The main contribution of that work is the reformulation of the dense CRF with pairwise potentials as an integral part of the network. By unrolling the mean-field inference steps as RNNs, they make it possible to fully integrate the CRF with a FCN and train the whole network endto-end. This work demonstrates the reformulation of CRFs as RNNs to form a part of a deep network, in contrast with Pinheiro et al. [86] which employed RNNs to model large spatial dependencies. Dilated Convolutions. Dilated convolutions, also named à -trous convolutions, are a generalization of Kronecker-factored convolutional filters [105] which support exponentially expanding receptive fields without losing resolution. In other words, dilated convolutions are regular ones that make use of upsampled filters. The dilation rate l controls that upsampling factor. As shown in Fig. 12, stacking l-dilated convolution makes the receptive fields grow exponentially while the number of parameters for the filters keeps a linear growth. This means that dilated convolutions allow efficient dense feature extraction on any arbitrary resolution. As a side note, it is important to remark that typical convolutions are just 1-dilated convolutions. In practice, it is equivalent to dilating the filter before doing the usual convolution. That means expanding its size, according to the dilation rate, while filling the empty elements with zeros. In other words, the filter weights are matched to distant elements which are not adjacent if the dilation rate is greater than one. Fig. 13 shows examples of dilated filters. The most important works that make use of dilated convolutions are the multi-scale context aggregation module by Yu et al. [75], the already mentioned DeepLab (its improved version) [73], and the real-time network ENet [76]. All of them use combinations of dilated convolutions with increasing dilation rates to have wider receptive fields with no additional cost and without overly downsampling the feature maps. Those works also show a common trend: dilated convolutions are tightly coupled to multi-scale context aggregation as we will explain in the following section. Multi-scale Prediction. Another possible way to deal with context knowledge integration is the use of multi-scale predictions. Almost every single parameter of a CNN affects the scale of the generated feature maps. In other words, the very same architecture will have an impact on the number of pixels of the input image which correspond to a pixel of the feature map. This means that the filters will implicitly learn to detect features at specific scales (presumably with certain invariance degree). Furthermore, those parameters are usually tightly coupled to the problem at hand, making it difficult for the models to generalize to different scales. One possible way to overcome that obstacle is to use multi-scale networks which generally make use of multiple networks that target different scales and then merge the predictions to produce a single output. Raj et al. [77] propose a multi-scale version of a fully convolutional VGG-16. That network has two paths, one that processes the input at the original resolution and another one which doubles it. The first path goes through a shallow convolutional network. The second one goes through the fully convolutional VGG-16 and an extra convolutional layer. The result of that second path is upsampled and combined with the result of the first path. That concatenated output then goes through another set of convolutional layers to generate the final output. As a result, the network becomes more robust to scale variations. Roy et al. [79] take a different approach using a network composed by four multi-scale CNNs. Those four networks have the same architecture introduced by Eigen et al. [78]. One of those networks is devoted to finding semantic labels for the scene. That network extracts features from a progressively coarse-to-fine sequence of scales (see Fig. 14). Another remarkable work is the network proposed by Bian et al. [80]. That network is a composition of n FCNs which operate at different scales. The features extracted from the networks are fused together (after the necessary upsampling with an appropriate padding) and then they go through an additional convolutional layer to produce the final segmentation. The main contribution of this architecture is the two-stage learning process which involves, first, training each network independently, then the networks are combined and the last layer is fine-tuned. This multi-scale model allows to add an arbitrary number of newly trained networks in an efficient manner.',
'Feature Fusion. Another way of adding context information to a fully convolutional architecture for segmentation is feature fusion. This technique consists of merging a global feature (extracted from a previous layer in a network) with a more local feature map extracted from a subsequent layer. Common architectures such as the original FCN make use of skip connections to perform a late fusion by combining the feature maps extracted from different layers (see Fig. 15). Another approach is performing early fusion. This approach is taken by ParseNet [81] in their context module. The global feature is unpooled to the same spatial size as the local feature and then they are concatenated to generate a combined feature that is used in the next layer or to learn a classifier. Fig. 16 shows a representation of that process. This feature fusion idea was continued by Pinheiro et al. in their SharpMask network [89], which introduced a progressive refinement module to incorporate features from the previous layer to the next in a top-down architecture. This work will be reviewed later since it is mainly focused on instance segmentation. In contrast to the pooling operation performed by ParseNet to incorporate global features and in addition to dilated FCNs [72,75], pyramid pooling empirically demonstrates the capability of global feature extraction by different-region-based context aggregation [82]. Fig. 17 shows Pyramid Scene Parsing Networks (PSPNets)24 which provide a pyramid parsing module focused into feature fusion at four different pyramid scales in order to embed global contexts from complex scenes. Pyramid levels and size of each level can be arbitrarily modified. The better performance of PSPNet facing FCNs-based models lies to: (1) the lack of ability in collecting contextual information, (2) the absence of category relationships and (3) not using sub-regions. This approach achieves state-of-the-art performance on various datasets. Recurrent Neural Networks. As we noticed, CNNs have been successfully applied to multi-dimensional data, such as images. Nevertheless, these networks rely on hand specified kernels limiting the architecture to local contexts. Taking advantage of its topological structure, Recurrent Neural Networks have been successfully applied for modeling shortand long-temporal sequences. In this way and by linking together pixel-level and local information, RNNs are able to successfully model global contexts and improve semantic segmentation. However, one important issue is the lack of a natural sequential structure in images and the focus of standard vanilla RNNs architectures on one-dimensional inputs. Based on ReNet model for image classification Visin et al. [20] proposed an architecture for semantic segmentation called ReSeg [83] represented in Fig. 18. In this approach, the input image is processed with the first layers of the VGG-16 network [15], feeding the resulting feature maps into one or more ReNet layers for fine-tuning. Finally, feature maps are resized using upsampling layers based on transposed convolutions. In this approach Gated Recurrent Units (GRUs) have been used as they strike a good performance balance regarding memory usage and computational power. Vanilla RNNs have problems modeling long-term dependencies mainly due to the vanishing gradients problem. Several derived models such as Long Short-Term Memory (LSTM) networks [106] and GRUs [107] are the state-of-art in this field to avoid such problem. Inspired on the same ReNet architecture, a novel Long ShortTerm Memorized Context Fusion (LSTM-CF) model for scene labeling was proposed by [108]. In this approach, they use two different data sources: RGB and depth. The RGB pipeline relies on a variant of the DeepLab architecture [32] concatenating features at three different scales to enrich feature representation (inspired by [109]). The global context is modeled vertically over both, depth and photometric data sources, concluding with a horizontal fusion in both directions over these vertical contexts. As we noticed, modeling image global contexts is related to 2D recurrent approaches by unfolding vertically and horizontally the network over the input images. Based on the same idea, Byeon et al. [85] purposed a simple 2D LSTM-based architecture in which the input image is divided into non-overlapping windows which are fed into four separate LSTMs memory blocks. This work emphasizes its low computational complexity on a single-core CPU and the model simplicity. Another approach for capturing global information relies on using bigger input windows in order to model larger contexts. Nevertheless, this reduces images resolution and also implies several problems regarding to window overlapping. However, Pinheiro et al. [86] introduced Recurrent Convolutional Neural Networks (rCNNs) which recurrently train with different input window sizes taking into account previous predictions by using a different input window sizes. In this way, predicted labels are automatically smoothed increasing the performance. Undirected cyclic graphs (UCGs) were also adopted to model image contexts for semantic segmentation [87]. Nevertheless, RNNs are not directly applicable to UCG and the solution is decomposing it into several directed graphs (DAGs). In this approach, images are processed by three different layers: image feature map produced by CNN, model image contextual dependencies with DAG-RNNs, and deconvolution layer for upsampling feature maps. This work demonstrates how RNNs can be used together with graphs to successfully model long-range contextual dependencies, overcoming state-of-the-art approaches in terms of performance.',
' Instance segmentation is considered the next step after semantic segmentation and at the same time the most challenging problem in comparison with the rest of low-level pixel segmentation techniques. Its main purpose is to represent objects of the same class splitted into different instances. The automation of this process is not straightforward, thus the number of instances is initially unknown and the evaluation of performed predictions is not pixel-wise such as in semantic segmentation. Consequently, this problem remains partially unsolved but the interest in this field is motivated by its potential applicability. Instance labeling provides us extra information for reasoning about occlusion situations, also counting the number of elements belonging to the same class and for detecting a particular object for grasping in robotics tasks, among many other applications. For this purpose, Hariharan et al. [10] proposed a Simultaneous Detection and Segmentation (SDS) method in order to improve performance over already existing works. Their pipeline uses, firstly, a bottom-up hierarchical image segmentation and object candidate generation process called Multiscale COmbinatorial Grouping (MCG) [110] to obtain region proposals. For each region, features are extracted by using an adapted version of the Region-CNN (RCNN) [111], which is fine-tuned using bounding boxes provided by the MCG method instead of selective search and also alongside region foreground features. Then, each region proposal is classified by using a linear Support Vector Machine (SVM) on top of the CNN features. Finally, and for refinement purposes, Non-Maximum Suppression (NMS) is applied to the previous proposals. Later, Pinheiro et al. [88] presented DeepMask model, an object proposal approach based on a single ConvNet. This model predicts a segmentation mask for an input patch and the likelihood of this patch for containing an object. The two tasks are learned jointly and computed by a single network, sharing most of the layers except last ones which are task-specific. Based on the DeepMask architecture as a starting point due to its effectiveness, the same authors presented a novel architecture for object instance segmentation implementing a top-down refinement process [89] and achieving a better performance in terms of accuracy and speed. The goal of this process is to efficiently merge low-level features with high-level semantic information from upper network layers. The process consisted in different refinement modules stacked together (one module per pooling layer), with the purpose of inverting pooling effect by generating a new upsampled object encoding. Fig. 19 shows the refinement module in SharpMask. Another approach, based on Fast R-CNN as a starting point and using DeepMask object proposals instead of Selective Search was presented by Zagoruyko et al. [90]. This combined system called MultiPath classifier, improved performance over COCO dataset and supposed three modifications to Fast R-CNN: improving localization with an integral loss, provide context by using foveal regions and finally skip connections to give multi-scale features to the network. The system achieved a 66% improvement over the baseline Fast R-CNN. As we have seen, most of the methods mentioned above rely on existing object detectors limiting in this way model performance. Even so, instance segmentation process remains an unresolved research problem and the mentioned works are only a small part of this challenging research topic. ']