CNNs 在图像分割中应用简史: 从R-CNN到Mask R-CNN
版权:本文版权归作者和博客园共有
转载:欢迎转载,但未经作者同意,必须保留此段声明;必须在文章中给出原文连接;否则必究法律责任 |
目标检测算法,如R-CNN输入一张图像,确定图像中主要物体的位置和类别。
- Inputs: Image 输入是图像
- Outputs: Bounding boxes + labels for each object in the image.输出是,对于图像中的每个物体给出 Bounding boxes+标签
选择性搜索Selective Search利用多尺度的窗口搜索有纹理、颜色和强度的相邻像素。 图片来源: https://www.koen.me/research/pub/uijlings-ijcv2013-draft.pdf
- Inputs: sub-regions of the image corresponding to objects.图像对应物体的子区域
- Outputs: New bounding box coordinates for the object in the sub-region.在子区域中物体的新的边界框坐标
- 生成一组候选边界框
- 将带有候选边界框的图像输入到预训练好的AlexNet中,最后用SVM判断图像中物体在哪个框中
- 如果物体被分类了(这个框确实包含物体),就把框输入到线性模型中,输出这个框更窄tighter的坐标。
2015: Fast R-CNN -加速、简化R-CNN
- 每张图像的每个候选区域都要输入到CNN(AlexNet)中(每张图像大约2000次)
- 需要分别训练三个不同的模型-生成图像特征的CNN,预测类别的分类器(SVM)和用于缩小边界框的回归模型。
这就是Fast R-CNN做的一个技巧--RoIPool (Region of Interest Pooling)感兴趣区域池化。在上图中,每个区域的CNN特征是通过从CNN的特征图中选择对应的区域得到的,然后,每个区域的特征再经过池化(通常最大池化)这样相比于之前的每张图像需要进入CNN2000次,用这个方法后每张图像就只经过CNN一次。
Fast R-CNN的第二个insight就是在一个单独模型中,联合训练CNN、分类器和边界框回归器,R-CNN需要有不同的模型提取图像特征(CNN)、分类(SVM)和缩小边界框(回归器)。
- Inputs: Images with region proposals.带有候选区域的图像
- Outputs: Object classifications of each region along with tighter bounding boxes.每个区域的物体分类和更窄的边界框
2016: Faster R-CNN - 加速候选区域
- Inputs: Images (Notice how region proposals are not needed).图像,不需要候选区域
- Outputs: Classifications and bounding box coordinates of objects in the images.图像中物体的分类和边界框坐标。
The Region Proposal Network slides a window over the features of the CNN. At each window location, the network outputs a score and a bounding box per anchor (hence 4k box coordinates where k is the number of anchors). Source: https://arxiv.org/abs/1506.01497.
We know that the bounding boxes for people tend to be rectangular and vertical. We can use this intuition to guide our Region Proposal networks through creating an anchor of such dimensions. Image Source: http://vlm1.uta.edu/~athitsos/courses/cse6367_spring2011/assignments/assignment1/bbox0062.jpg.
- Inputs: CNN Feature Map. CNN特征图
- Outputs: A bounding box per anchor. A score representing how likely the image in that bounding box will be an object. 每个anchor一个边界框,打分代表边界框内是物体的可能性
2017: Mask R-CNN - 将Faster R-CNN拓展到像素级分割
The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are. Source: https://arxiv.org/abs/1703.06870.
Kaiming He, a researcher at Facebook AI, is lead author of Mask R-CNN and also a coauthor of Faster R-CNN.
- Inputs: CNN Feature Map. CNN特征图
- Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).二值矩阵,当该像素属于物体时mask值为1,否则为0
Instead of RoIPool, the image gets passed through RoIAlign so that the regions of the feature map selected by RoIPool correspond more precisely to the regions of the original image. This is needed because pixel level segmentation requires more fine-grained alignment than bounding boxes. Source: https://arxiv.org/abs/1703.06870.
How do we accurately map a region of interest from the original image onto the feature map?
在RoIPool中我们大概选择两个像素,得到了有些不对齐的结果,然而在RoIAlign中我们避免用这样的舍入,用双线性插值得到准确的2.93。这样就可以避免RoIPool产生的不对齐。
当得到了这些masks后,Mask R-CNN用Faster R-CNN生成的分类和边界框将他们结合起来,生成精确地分割。
Mask R-CNN is able to segment as well as classify the objects in an image. Source: https://arxiv.org/abs/1703.06870.