
We present a residual learning framework to ease the training

of networks that are substantially deeper than those used

previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual

networks are easier to optimize, and can gain accuracy from

considerably increased depth.




An obstacle to answering this question was the notorious

problem of vanishing/exploding gradients [1, 9], which

hamper convergence from the beginning. This problem,

however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers, which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

堆叠多层网络的时候,存在梯度消失/梯度爆炸的问题,阻碍模型收敛,这一问题已经被 normalized initializationintermediate normalization解决。


The degradation (of training accuracy) indicates that not

all systems are similarly easy to optimize. Let us consider a

shallower architecture and its deeper counterpart that adds

more layers onto it. There exists a solution by construction

to the deeper model: the added layers are identity mapping,

and the other layers are copied from the learned shallower

model. The existence of this constructed solution indicates

that a deeper model should produce no higher training error

than its shallower counterpart.




In this paper, we address the degradation problem by

introducing a deep residual learning framework.




Identity shortcut connections add neither extra parameter nor computational

complexity. The entire network can still be trained

end-to-end by SGD with backpropagation, and can be easily

implemented using common libraries (e.g., Caffe [19])

without modifying the solvers.

 shortcut connections 跨越一层或者多层



We show that: 1) Our extremely deep residual nets

are easy to optimize, but the counterpart plainnets (that

simply stack layers) exhibit higher training error when the

depth increases; 2) Our deep residual nets can easily enjoy

accuracy gains from greatly increased depth, producing results

substantially better than previous networks.


1  更容易优化(easier to optimize

2  can gain accuracy from increased depth,即能够做到网络越深,准确率越高




1  Fx相同维度时,直接相加(element-wise addition)


2  Fx维度不同时,需要先将x做一个变换(linear projection),然后再相加:





The convolutional layers mostly have 33 filters and

follow two simple design rules: (i) for the same output

feature map size, the layers have the same number of filters;

and (ii) if the feature map size is halved, the number

of filters is doubled so as to preserve the time complexity

per layer. We perform downsampling directly by

convolutional layers that have a stride of 2.

卷积层主要为3*3的滤波器,并遵循以下两点要求:(i) 输出特征尺寸相同的层含有相同数量的滤波器(ii) 如果特征尺寸减半,则滤波器的数量增加一倍来保证每层的时间复杂度相同。我们直接通过stride 为2的卷积层来进行下采样。在网络的最后是一个全局的平均pooling层和一个1000 类的包含softmax的全连接层。加权层的层数为34.


When the dimensions increase (dotted line shortcuts

in Fig. 3), we consider two options: (A) The shortcut still

performs identity mapping, with extra zero entries padded

for increasing dimensions. This option introduces no extra

parameter; (B) The projection shortcut in Eqn.(2) is used to

match dimensions (done by 11 convolutions). For both

options, when the shortcuts go across feature maps of two

sizes, they are performed with a stride of 2.

(A) 仍然使用恒等映射,在增加的维度上使用0来填充,这样做不会增加额外的参数;
(B) 使用Eq.2的映射shortcut来使维度保持一致(通过1*1的卷积)。




posted on 2018-12-06 16:14  JP000  阅读(202)  评论(0编辑  收藏  举报