[Papers] Semantic Segmentation Papers(1)
Tags: Paper
总结几篇看过的语义分割论文,FCN, DeconvNet, SegNet, U-Net,后面会再总结DeepLab的论文.
FCN
Abstract
提出end to end FCN,输入arbitrary size image, 输出同样大小的label map. FCN中的skip architecture combines semantic information from a deep coarse layer with appearance information from a shallow fine layer to produce accurate and detailed segmentations.
Introduction
使用supervised pretrained classification netowrk来进行pixel wise prediction.
语义分割问题面对的问题是语义信息和位置信息之间的inherent tension
Related Work
FCN
FCN作为将深度学习应用到分割问题上的开山鼻祖,虽然不是end-to-end 的,但是为后面的U-net, E-net, SegNet打下基础,特别是使用deconvolution 来对 coarse map unsample这一想法.
Adapting classifiers for dense prediction
全连接层可以看做在整个feature map上卷积的特殊情况,去除网络最后的全连接层网络输出的是label map加上spatial loss 就可以进行end-to-end dense learning.
Shift-and-stitch is filter rarefaction
rarefaction: 稀薄化
a trous algorithm
FCN还提到了后面DeepLab中用到的带孔卷积
Upsampling is backwards strided convolution
In a sense, upsampling with factor \(f\) is convolution with a fractional input stride of \(1/f\). So long as \(f\) is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of \(f\).
这段话有点费解.
Deconvolution的解释:
- https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers
- https://github.com/vdumoulin/conv_arithmetic
stride two and padding:
称为transposed convolution是因为,transposed convolution经常用在backward计算的时候,反向传播可以通过乘以权重矩阵的转置完成. 图中的filter明明是stride 1为什么是stride 2呢?stride 2是相对原图(没有stride之前), 因为每个像素之前插入了0,现在的stride 1 就相当于原来的stride 2.
Shown above is a transposed convolution. 'stride two' means stride in the corresponding original convolution is two. This is precisely why you have 1 (=2-1, 2 being the original stride) layer of zeros in between rows and columns. Transposed convolution is generally used in backward pass. It is called transposed because of the analogy with fully connected layer where you multiply with the transpose of the weight matrix during a backward pass.
patchwise trainig is loss sampling
Segmentation Architecture
作者fully convolution networks主要由in-network unsampling和pixelwise loss组成, 此外还有skip architecture.
Learning DeConvolution Network for Semantic Sefeijiegmentation
Abstract
deep deconvolution + proposal-wise prediction
反卷积网络由反卷积和上采样层组成
1. Introduction
现有的基于CNN semantic segmentation网络大都是对前面分类网络得到的label map(FCN中是16*16)做基于bilinear interpolation的deconvolution. 然而这种deconvolution 的输入是前面经过convolution 和pooling 的 feature map这个feature map已经失去了很多structured details, 往往使用deconvolution不能得到很好的效果。
一些方法使用FCN + Conditional Random Field来解决这一问题。
2. Related Work
FCN:
FCN由于其fixed size 的 receptive field使其对于过小的物体不能分类,对于过大的物体则会预测处多个类别(大小相对于receptive field而言).
FCN+CRF
3. System Architecture
网络的encoder是VGG分类网络,网络的decoder是对分类网络得到的feature map进行unpooling的deconvolution网络,最后网络输出的是概率图,对于每个像素属于每一个类别的概率. 最后得到每个像素类别的label. 这里可以提前说下DeconvNet没有去除VGG分类网络的fully connected layer, 而fully connected layer中有大量的参数,最后训练处理出的模型会占用大量的空间. 如果是做Application级别的产品最好还是用后面的SegNet, SegNet去除了fully connected layer不管是训练速度还是占用内存都要小很多.
Unpooling和Deconvolution
Unpooling
什么是pooling?
Pooling in convolution network is designed to filter noisy
activations in a lower layer by abstracting activations in a
receptive field with a single representative value.
虽然pooling可以增强激活区域的鲁棒性,但是同事也丢失了感受域内的空间信息。这些structure information可能对需要dense prediction的segmentation有较大的作用.
如何实现unpooling?
记录pooling时最大激活点(maximum activation)的位置。
deconvolution
从unpooling处得到的内容是稀疏的,通过deconvolution 可以得到enlarge dense 的 activation map. 然后将enlarge 边缘的像素裁剪掉得到和unpooling 输入大小一样的feature map.
在网络中unpooling和deconvolution的作用是不一样的:可以说unpooling是example specific的而deconvolution是class specific的. example specific意思就是只要是object那么unpooling通过前面pooling记录的 location information重建object的structure, 但是我们需要对每个像素点进行分类,那么你得到object stucture还不够,周围还有噪声信息和非target class的信息,那么deconvolution就是对其target class信息进行放大,对非target class信息进行抑制. 结合二者, decoder端的deconvolution network就可以输出较为准确的segmentation map.
其实从这两点而言DeconvNet和SegNet的decoder端的结构很相似的. 上采样得到sparse activation map然后通过deconv/conv得到dense activation map.
从下面activation map的可视化也可以看出encoder端是特征逐渐抽象(detail to coarse)的过程而decoder是从(coase to detail)的过程:
instance wise segmentation vs. image level segmentation
这里没怎么看懂
Training
- Batch Normalization
- Two-stage Training
- ensemble with FCN
网络详细结构:
Inference
测试的时候每张图像在输入网络之前,作者使用edge-box来产生candicate proposals这样可以在不同尺度上检测物体. 每张测试图片先产生2000个candicates然后根据object score挑选50个输入网路. 前面提到的instance wise segmentation也应该和这里有关,感觉作者介绍的不是很详细.
总体而言DeConvNet的idea虽然比较novel(不知道SegNet有没有借鉴DeConvNet), 但是很明显网络过深,很难训练,而且没有去除fully connected layer, 还需要使用edge-box产生candicate proposal, 不是一个end-to-end的网络. 实际使用的话我还是推荐SegNet吧.
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Abstract
The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample.
1. Introduction
在decoding端重复使用encoding的 max-pooling indices:
- improves boundary delineation
- reduces the number of parameters enabling end-to-end training
Architecture
without fully connected layer(134M to 14.7M)
encoder
conv + batchnorm + ReLU + max pooling(2*2)
to keep the spatial resolution of the feature map after max pooling, Segnet choose to store max pooling indices.
decoder
upsample feature maps using max pooling indices -> sparse feature maps. + trainable filter banks + batch norm
Use variant kinds of decoders to compare
Training
- median frequency balancing
- natural frequency balancing
analysis
BF: boundry F1 measure
SegNet和Deconvolution Net相似之处都是在encoder端保存max pooling indices,然后在decoder端使用indices进行unsample得到feature map, 然而这个时候得到的feature map仍然是稀疏的,因此在这个feature map之后再接convolution layer/deconvolutional得到更好的feature map. SegNet和Deconvoluton Net差别在于SegNet没有fully connected layer是一个更加轻量的框架.
U-Net
Abstract
- use data augumentation to train the model
- contracting path to capture context
- symmetric expanding path enables precise localization
Introduction
- High resolution features from the contracting path are combined with the upsampled output
- overlap-tile strategy 这里没怎么看懂啊
- elastic deformation for augmentation
- 使用weight loss解决多分类问题中的touching border问题
Network Architecture
左边是contracting path, 右边是expansive path
左边使用33 convolution + ReLU + 22 max-pooling, 每次pooling feature channels 加倍
右边使用upsampling + 22 convolution(feature channels数目减半)+concatenation with corresponding feature map from contracting path + 33 convolution + ReLU
Training
- energy function:
- weight map:
- 每一层的权重初始化,高斯分布,std: \(\sqrt{2/N}\)
Experiments
在两个医学数据集上都取得了较好的效果.
总体而言U-net结构是比较简单的,而且根据作者所言比较适合小数据集,第一个来自于EM segmentation challenge 中只有30张(512*512)图片,