DCNN models


egion based
  • RNN
  • Fast RCNN
  • Faster RCNN
  • F-RCN



  1. Faster RCNN

the first five layers is same as the ZF network.



the size of the input image is 224*224*3, after the first convolutional layer, the size of the feature map is 110*110*96( because the convolutional kernel is 7*7*3*96, 7,7 is width, and height of the kernel, 3 is the channels of the input, and 96 is the channels of the output. In caffe framework, all data is represent by blob, which is w*h*c*d, 110=(224-7+pad)/stride+1. The size of the first pooling layer is 3*3. the size of the feature map by the pooling layer is 55*55*96 ..... ) Finally, the model extract the output of the conv5(13*13*256), this feature map will be server as the input of the RPN.

RPN(region proposal network)




In the paper, 3*3 sliding windows is chosen. a 3*3*256*256 convolutional kernel is chosen to produce 256-d vectors(the size of the output is ((3-3)+1)*((3-3)+1)*256). between the cls layer and the 256-d layer, a 1*1*256*18 convolutional kernel is used, which is served as a fully connected layer. (if the size of the kernel is same as the input, it is called fully connected layer), For the reg layer, a 1*1*256*36 kernel is used. the network defined in caffe is:

name: "ZF"
layer {
name: 'input-data'
type: 'Python'
top: 'data' # top表示该层的输出,所以可以看到这一层输出三组数据,data,真值框gt_boxes,和相关信息im_info
top: 'im_info' # 这些都是存储在矩阵中的
top: 'gt_boxes'
python_param {
module: 'roi_data_layer.layer'
layer: 'RoIDataLayer'
param_str: "'num_classes': 21"
#========= conv1-conv5 ============
layer {
name: "conv1"
type: "Convolution"
bottom: "data" # 输入data
top: "conv1" # 输出conv1,这里conv1就代表了这一层输出数据的名称,存储在对应的矩阵中
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 96
kernel_size: 7
pad: 3 # 这里可以看到卷积1层 填充了3个像素
stride: 2
layer {
name: "relu1"
type: "ReLU"
bottom: "conv1"
top: "conv1"
layer {
name: "norm1"
type: "LRN"
bottom: "conv1"
top: "norm1" # 做归一化操作,通俗点说就是做个除法
lrn_param {
local_size: 3
alpha: 0.00005
beta: 0.75
norm_region: WITHIN_CHANNEL
engine: CAFFE
layer {
name: "pool1"
type: "Pooling"
bottom: "norm1"
top: "pool1"
pooling_param {
kernel_size: 3
stride: 2
pad: 1 # 池化的时候,又做了填充
pool: MAX
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 5
pad: 2
stride: 2
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
layer {
name: "norm2"
type: "LRN"
bottom: "conv2"
top: "norm2"
lrn_param {
local_size: 3
alpha: 0.00005
beta: 0.75
norm_region: WITHIN_CHANNEL
engine: CAFFE
layer {
name: "pool2"
type: "Pooling"
bottom: "norm2"
top: "pool2"
pooling_param {
kernel_size: 3
stride: 2
pad: 1
pool: MAX
layer {
name: "conv3"
type: "Convolution"
bottom: "pool2"
top: "conv3"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 384
kernel_size: 3
pad: 1
stride: 1
layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
layer {
name: "conv4"
type: "Convolution"
bottom: "conv3"
top: "conv4"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 384
kernel_size: 3
pad: 1
stride: 1
layer {
name: "relu4"
type: "ReLU"
bottom: "conv4"
top: "conv4"
layer {
name: "conv5"
type: "Convolution"
bottom: "conv4"
top: "conv5"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 1
layer {
name: "relu5"
type: "ReLU"
bottom: "conv5"
top: "conv5"
#========= RPN ============
# 到我们的RPN网络部分了,前面的都是共享的5层卷积层的部分
layer {
name: "rpn_conv1"
type: "Convolution"
bottom: "conv5"
top: "rpn_conv1"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 3 pad: 1 stride: 1 #这里作者把每个滑窗3*3,通过3*3*256*256的卷积核输出256维,完整的输出其实是12*12*256,
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
layer {
name: "rpn_relu1"
type: "ReLU"
bottom: "rpn_conv1"
top: "rpn_conv1"
layer {
name: "rpn_cls_score"
type: "Convolution"
bottom: "rpn_conv1"
top: "rpn_cls_score"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 18 # 2(bg/fg) * 9(anchors)
kernel_size: 1 pad: 0 stride: 1 #这里看的很清楚,作者通过1*1*256*18的卷积核,将前面的256维数据转换成了18个输出
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
layer {
name: "rpn_bbox_pred"
type: "Convolution"
bottom: "rpn_conv1"
top: "rpn_bbox_pred"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 36 # 4 * 9(anchors)
kernel_size: 1 pad: 0 stride: 1 <span style="font-family: Arial, Helvetica, sans-serif;">#这里看的很清楚,作者通过1*1*256*36的卷积核,将前面的256维数据转换成了36个输出</span>
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
layer {
bottom: "rpn_cls_score"
top: "rpn_cls_score_reshape" # 我们之前说过,其实这一层是12*12*256的,所以后面我们要送给损失函数,需要将这个矩阵reshape一下,我们需要的是144个滑窗,每个对应的256的向量
name: "rpn_cls_score_reshape"
type: "Reshape"
reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 } }
layer {
name: 'rpn-data'
type: 'Python'
bottom: 'rpn_cls_score'
bottom: 'gt_boxes'
bottom: 'im_info'
bottom: 'data'
top: 'rpn_labels'
top: 'rpn_bbox_targets'
top: 'rpn_bbox_inside_weights'
top: 'rpn_bbox_outside_weights'
python_param {
module: 'rpn.anchor_target_layer'
layer: 'AnchorTargetLayer'
param_str: "'feat_stride': 16"
layer {
name: "rpn_loss_cls"
type: "SoftmaxWithLoss" # 很明显这里是计算softmax的损失,输入labels和cls layer的18个输出(中间reshape了一下),输出损失函数的具体值
bottom: "rpn_cls_score_reshape"
bottom: "rpn_labels"
propagate_down: 1
propagate_down: 0
top: "rpn_cls_loss"
loss_weight: 1
loss_param {
ignore_label: -1
normalize: true
layer {
name: "rpn_loss_bbox"
type: "SmoothL1Loss" # 这里计算的框回归损失函数具体的值
bottom: "rpn_bbox_pred"
bottom: "rpn_bbox_targets"
bottom: "rpn_bbox_inside_weights"
bottom: "rpn_bbox_outside_weights"
top: "rpn_loss_bbox"
loss_weight: 1
smooth_l1_loss_param { sigma: 3.0 }
#========= RCNN ============
# Dummy layers so that initial parameters are saved into the output net
layer {
name: "dummy_roi_pool_conv5"
type: "DummyData"
top: "dummy_roi_pool_conv5"
dummy_data_param {
shape { dim: 1 dim: 9216 }
data_filler { type: "gaussian" std: 0.01 }
layer {
name: "fc6"
type: "InnerProduct"
bottom: "dummy_roi_pool_conv5"
top: "fc6"
param { lr_mult: 0 decay_mult: 0 }
param { lr_mult: 0 decay_mult: 0 }
inner_product_param {
num_output: 4096
layer {
name: "relu6"
type: "ReLU"
bottom: "fc6"
top: "fc6"
layer {
name: "fc7"
type: "InnerProduct"
bottom: "fc6"
top: "fc7"
param { lr_mult: 0 decay_mult: 0 }
param { lr_mult: 0 decay_mult: 0 }
inner_product_param {
num_output: 4096
layer {
name: "silence_fc7"
type: "Silence"
bottom: "fc7"






  1. F-RCN: (region-based fully convolutional network)

F-RCN is faster than Faster RCNN, because the layers follow the ROI Pooling the connected layers. In F-RCN, there is not convolutional layer or fully connnected layers. and it use ResNet to take the place of ZF. In ResNet, most layers is convolutional layers. there are not pooling and fully connected layer, so it is categoried to fully convolutional network.

the intuition of the F-RCN is trying to speed up the Fast RCNN and share the calculation. F-RCN uses the first 100 layers of ResNet to extract feature map. The channels of the feature map is 2048, For reducing the dimension, a 1*1*2048*1024 kernel is added. and a convolutional layer is added to produce score maps for classification; and a convolutional layer is added to produce bounding box regression.







除了主网络ResNet以外,还有RPN网络用于生成ROI(region proposal),因此在训练的时候,作者采用RPN网络和R-FCN交替训练的方式来共享特征。



因此为了将平移敏感性引入全卷积网络,作者在全卷积网络的输出位置添加一系列特定的卷积层用于生成position-sensitive的score map,每个score map保存目标的空间位置信息。然后再添加ROI Pooling层,该层后面不再跟卷积层或全连接层。这样整个网络不仅可以end-to-end训练,而且所有层的计算都是在整个图像上共享的。





Caffe的代码: 首先是数据读入操作,假设输出的data是1*3*600*1000,im_info是1*3,gt_boxes是1*4,后面的所有维度都是以这个假设为前提。









然后是分类层和回归层分类层采用1*1的卷积核,pad=0,stride=1的18(2(back ground/fore ground)*9(anchors))个卷积核的卷积层,分类层的输出是1*18*38*63。回归采用1*1的卷积核,pad=0,stride=1的36(4*9(anchors))个卷积核的卷积层,回归层的输出是1*36*38*63。












然后是ROI Proposal,先用一个softmax层算出概率(1*2*342*63),然后再reshape到1*18*38*63。






这一层生成rois(1*5*1*1),labels(1*1*1*1),bbox_targets(1*8*1*1),bbox_inside_weights (1*8*1*1),bbox_outside_weights(1*8*1*1)。
















开始进入ROI pooling操作了,上面一层,有两个输入:rfcn_cls(1*1029*38*63)是预测的结果,rois(1*5*1*1)是ROI,生成1*21*7*7的结果。下面一层是均值池化,得到1*21*1*1(cls_score),就是论文中vote的过程。












可以看出在ROI Pooling层后就没有卷积层和全连接层了

  1. regression based

  • YOLO
  • SSD


