我的github

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

理解深度卷积神经网络中的有效感受野

Abstract摘要

We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. We analyze the effective receptive field in several architecture designs, and the effect of nonlinear activations, dropout, sub-sampling and skip connections on it. This leads to suggestions for ways to address its tendency to be

too small.

我们研究在深度卷积网络中单元的感受野特征。感受野大小是一个在很多视觉任务中重要的问题,因为输出必须对应大范围的区域以捕捉大物体的信息。我们引入了有效感受野的概念,并且显示两者都呈高斯分布,并且只占据全部理论感受野的一部分。我们以几种结构设计分析有效感受野,和非线性激活、剔除、子采样和跳过其上连接的影响。这就指引我们提出了一些方法来解决其规模过小的问题。

1 Introduction简介

Deep convolutional neural networks (CNNs) have achieved great success in a wide range of problems in the last few years. In this paper we focus on their application to computer vision: where they are the driving force behind the significant improvement of the state-of-the-art for many tasks recently, including image recognition [10, 8], object detection [17, 2], semantic segmentation [12, 1], image captioning [20], and many more.

深度卷积神经网络(CNNs)已经取得了很大的胜利。在本文我们聚焦在他们在计算机视觉上的应用:其中他们驱动了很多任务背后先进的力量,包括图像识别、物体检测、语义分割、图像标注、等等。

One of the basic concepts in deep CNNs is the receptive field, or field of view, of a unit in a certain layer in the network. Unlike in fully connected networks, where the value of each unit depends on the entire input to the network, a unit in convolutional networks only depends on a region of the input. This region in the input is the receptive field for that unit.

在深度CNN中一个基本的概念就是在网络中一个确定的图层单位的感受野,或者视野。不同于全连接网络其中每个单元值都依赖整个网络的输入,在conv网络的一个单元依赖于输入区域。输入图层中的该区域就是该单元的感受野。

The concept of receptive field is important for understanding and diagnosing how deep CNNs work. Since anywhere in an input image outside the receptive field of a unit does not affect the value of that unit, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for each single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction.

感受野的概念对于理解和诊断深度CNN如何工作是很重要的。因为输入图片中除了单位的感受野以外的任何区域都不会影响那个单位的值,因此有必要小心控制感受野,以确保它覆盖了整个相关的图像区域。在很多任务中,尤其是紧密预测任务例如语义图像分割,立体和光流估计中,其中我们对输入图像中的每一个像素都做了预测,很有必要对每一个输出像素有一个大的感受野,从而预测时不会有重要的信息遗漏。

The receptive field size of a unit can be increased in a number of ways. One option is to stack more layers to make the network deeper, which increases the receptive field size linearly by theory, as each extra layer increases the receptive field size by the kernel size. Sub-sampling on the other hand increases the receptive field size multiplicatively. Modern deep CNN architectures like the VGG networks [18] and Residual Networks [8, 6] use a combination of these techniques.

一个单元的感受野大小可以以多种方式增长。一种选择是堆叠更多层以使网络更深,它理论上线性地增加了感受野的大小,因为每一个额外的层都通过kernel大小而增加了感受野的大小。子采样从另一方面以乘积式的增加了感受野大小。当前深度CNN结构如VGG网络和Residual Network使用这些方法的结合。

In this paper, we carefully study the receptive field of deep CNNs, focusing on problems in which there are many output unites. In particular, we discover that not all pixels in a receptive field contribute equally to an output unit’s response. Intuitively it is easy to see that pixels at the center of a receptive field have a much larger impact on an output. In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptive field have very few paths to propagate its impact. In the backward pass, gradients from an output unit are propagated across all the paths, and therefore the central pixels have a much larger magnitude for the gradient from that output.

在本文,我们仔细地研究了深度CNN的感受野,聚焦在有许多输出单元的问题。特别地,我们发现并不是一个感受野中所有的像素都平等地对输出单元的响应贡献相同。直觉地很容易看到感受野中心的像素对输出有更大的影响。在正向传输中,中心像素可以以多种途径对输出结果传播信息,而感受野以外的像素只有很少的途径传播它的影响。在反向途径中,来自输出单元的梯度通过所有的途径传播,因此中心像素对于那个输出的梯度有更大的幅度。

This observation leads us to study further the distribution of impact within a receptive field on the output. Surprisingly, we can prove that in many cases the distribution of impact in a receptive field distributes as a Gaussian. Note that in earlier work [20] this Gaussian assumption about a receptive field is used without justification. This result further leads to some intriguing findings, in particular that the effective area in the receptive field, which we call the effective receptive field, only occupies a fraction of the theoretical receptive field, since Gaussian distributions generally decay quickly from the center.

这一观察引导我们进一步研究一个感受野的影响在输出结果中的分布。令人惊讶的是,我们多个例子可以证明感受野的影响分布是以高斯分布的。注意在早前的工作[20]中这个关于感受野的高斯假设是未经证明就使用了。这一结果进一步引起了一些有趣的发现,尤其是感受野的有效区域,我们将其称为有效感受野,只占据了理论上的感受野的一部分,因为高斯分布一般从中心迅速衰减。

The theory we develop for effective receptive field also correlates well with some empirical observations. One such empirical observation is that the currently commonly used random initializations lead some deep CNNs to start with a small effective receptive field, which then grows during training. This potentially indicates a bad initialization bias.

我们对于有效感受野的理论也与一些经验观测结果有很好的相关性。一个这样的理论观测比如当前广泛采用的随机初始化导致某些深度CNN从一个小的有效感受野开始,它在训练中增长。这潜在地暗示了一个坏的初始偏差。

Below we present the theory in Section 2 and some empirical observations in Section 3, which aim at understanding the effective receptive field for deep CNNs. We discuss a few potential ways to increase the effective receptive field size in Section 4.

下面我们在第2部分展示了理论,在第3部分展示了一些经验观察,它们旨在理解深度CNN的有效感受野。我们在第4部分讨论了几个潜在的增长有效感受野大小的方法。

2 Properties of Effective Receptive Fields有效感受野的属性

We want to mathematically characterize how much each input pixel in a receptive field can impact the output of a unit n layers up the network, and study how the impact distributes within the receptive field of that output unit. To simplify notation we consider only a single channel on each layer, but similar results can be easily derived for convolutional layers with more input and output channels.

我们想数学地描述每一个感受野中的输入像素能如何影响一个网络上单位n图层的输出,并且研究那个输出单元对应的感受野的影响是如何分布的。为了简化符号我们只考虑每一层只有一个颜色通道,但是类似的多颜色通道的结果对于卷积图层可以轻易地推导。

Assume the pixels on each layer are indexed by (i, j), with their center at (0, 0). Denote the (i, j)th pixel on the pth layer as xpi,j , with x0i,j as the input to the network, and yi,j = xni,j as the output on the nth layer. We want to measure how much each x0i,j contributes to y0,0. We define the effective receptive field (ERF) of this central output unit as region containing any input pixel with a non-negligible impact on that unit.

假设每一层上的元素以(i,j)索引,它们的中心点在(0,0)。将第p层上的(i,j)的元素表示为xpi,j,x0i,j作为网络的输入,而yi,j=xni,j作为第n层的输出。我们想要测量每个x0i,j对y0,0贡献如何。我们定义中心输出单元的ERF(Effective Receptive Field)为,包含任何输入像素对该单元具有不可忽略的影响的区域。

The measure of impact we use in this paper is the partial derivative ∂y0,0/∂x0i,j . It measures how much y0,0 changes as x0i,j changes by a small amount; it is therefore a natural measure of the importance of x0i,j with respect to y0,0. However, this measure depends not only on the weights of the network, but are in most cases also input-dependent, so most of our results will be presented in terms of expectations over input distribution.

本文中我们对影响的测量使用的符号是偏导:∂y0,0/∂x0i,j。它测量了当x0i,j改变一点时y0,0是如何改变的;因此它是关于x0i,j相对于y0,0的一个自然的测量。然而,该测量不止依赖于网络的权重,而且最大程度上依赖于输入,所以我们的输出绝大多数将表示为输入分布的期望。

The partial derivative ∂y0,0/∂x0i,j can be computed with back-propagation. In the standard setting, back-propagation propagates the error gradient with respect to a certain loss function. Assuming we have an arbitrary loss l, by the chain rule we have .

偏微分∂y0,0/∂x0i,j可以通过反向传播计算得到。在标准情况下,反向传播将传播对于某一误差函数的误差梯度。假设我们有一任意的误差函数l,那么通过链式规则我们有。

Then to get the quantity ∂y0,0/∂x0i,j , we can set the error gradient ∂l/∂y0,0 = 1 and ∂l/∂yi,j = 0 for all i ≠ 0 and j ≠ 0, then propagate this gradient from there back down the network. The resulting ∂l/∂x0i,j equals the desired ∂y0,0/∂x0i,j. Here we use the back-propagation process without an explicit loss function, and the process can be easily implemented with standard neural network tools.

那么要得到数量∂y0,0/∂x0i,j,我们可以设置误差梯度∂l/∂y0,0 = 1和∂l/∂yi,j = 0对于所有的i≠0, j≠0的项,然后将这个梯度沿网络向下传播。这样得到的结果∂l/∂x0i,j就等于我们想要的∂y0,0/∂x0i,j。这里我们使用不带明确的损失函数的反向传播过程,而且该过程可以轻松使用标准的神经网络工具实现。

In the following we first consider linear networks, where this derivative does not depend on the input and is purely a function of the network weights and (i, j), which clearly shows how the impact of the pixels in the receptive field distributes. Then we move forward to consider more modern architecture designs and discuss the effect of nonlinear activations, dropout, sub-sampling, dilation convolution and skip connections on the ERF.

接下来我们首先考虑线性网络,其中导数不依赖于输入,只是单纯的一个关于网络权重和(i, j)的函数,它明晰地显示了感受野内的像素贡献的影响如何。然后我们继续考虑更多当前的结构设计,讨论非线性激活、剔除、子采样、卷积的影响。

2.1 The simplest case: a stack of convolutional layers of weights all equal to one

2.1 最简单的例子:一系列的卷积层,它们的权重全为1

Consider the case of n convolutional layers using k × k kernels with stride one, one single channel on each layer and no nonlinearity, stacked into a deep linear CNN. In this analysis we ignore the biases on all layers. We begin by analyzing convolution kernels with weights all equal to one.

考虑n卷积图层的例子,第1节使用k×k核,每一个图层上使用一个通道,而且没有非线性,叠加进一个深度线性CNN中。在该分析中我们忽略了所有图层的偏移。我们以权重全为1分析卷积核开始。

Denote g(i, j, p) = ∂l/∂xpi,j as the gradient on the pth layer, and let g(i, j, n) = ∂l/∂yi,j . Then g(,, 0) is the desired gradient image of the input. The back-propagation process effectively convolves g(,,p) with the k × k kernel to get g(,,p -1) for each p.

使用g(i, j, p) = ∂l/∂xpi,j表示第p层的梯度,并且让g(i,j,n)= ∂l/∂yi,j。那么g(,,0)为要求的输入梯度图像。反向传播过程有效地使用k×k核对g(,,p)进行卷积而得到每一个层p的g(,,p-1)。

In this special case, the kernel is a k × k matrix of 1’s, so the 2D convolution can be decomposed

into the product of two 1D convolutions. We therefore focus exclusively on the 1D case. We have the initial gradient signal u(t) and kernel v(t) formally defined as

在该特殊例子中,核是一个1的k×k的矩阵,所以2维卷积可以被分解成两个1维卷积的乘积。我们从而主要专心于1D的例子。我们将原始的梯度信号u(t)和核v(t)正式定义为

(1)

and t = 0, 1, -1, 2, -2, … indexes the pixels.

而且t=0,1,-1,2,-2,…索引所有的像素。

The gradient signal on the input pixels is simply o = u * v *···* v, convolving u with n such v’s. To compute this convolution, we can use the Discrete Time Fourier Transform to convert the signals into the Fourier domain, and obtain

在输入像素上的梯度信号就是简单的o = u * v *···* v,使用n个这样的v来对u进行卷积。为了计算这个卷积,我们使用分离时间傅里叶变换来将信号转换到傅里叶空间域,然后得到:

(2)

Applying the convolution theorem, we have the Fourier transform of o is

应用卷积定理,我们就得到o的傅里叶变换为:

(3)

Next, we need to apply the inverse Fourier transform to get back o(t):

下一步,我们需要应用逆向的傅里叶转换得到o(t):

(5)

 

 

(4)

 

 

We can see that o(t) is simply the coefficient of e-jωt in the expansion of .

我们可以看到o(t)只是e-jωt以的形式扩展的协因子。

Case k = 2: Now let’s consider the simplest nontrivial case of k = 2, where . The coefficient for e-jωt is then the standard binomial coefficient , so .It is quite well known that binomial coefficients distributes with respect to t like a Gaussian as n becomes large (see for example [13]), which means the scale of the coefficients decays as a squared exponential as t deviates from the center. When multiplying two 1D Gaussian together, we get a 2D Gaussian, therefore in this case, the gradient on the input plane is distributed like a 2D Gaussian.

k=2的例子:现在让我们考虑k=2的最简单的非平凡例子,其中。e-jωt的协因子从而是标准的二项协因子,所以。很容易知道当n变大时关于t的二项协因子分布像个高斯(看例子[13]),这意味着协因子尺寸如t从中心推导的二次指数衰减。当将两个1维的高斯乘在一起,我们得到一个2维的高斯,因此在本例中,在输入平面上的梯度呈2维高斯分布。

Case k > 2: In this case the coefficients are known as “extended binomial coefficients” or “polynomial coefficients”, and they too distribute like Gaussian, see for example [3, 16]. This is included as a special case for the more general case presented later in Section 2.3.

k>2的例子:在这种情况下协因子以“扩展的二项协因子”或“多项式协因子”而著称,而且它们也呈高斯分布,查看例子[3, 16]。这被作为一种更广泛例子的特殊情况,后面在2.3节呈现。

2.2 Random weights

2.2 随机权重

Now let’s consider the case of random weights. In general, we have

现在让我们考虑随机权重的情况。总的来说,我们有

          (6)

with pixel indices properly shifted for clarity, and wpa,b is the convolution weight at (a, b) in the convolution kernel on layer p. At each layer, the initial weights are independently drawn from a fixed distribution with zero mean and variance C. We assume that the gradients g are independent from the weights. This assumption is in general not true if the network contains nonlinearities, but for linear networks these assumptions hold. As , we can then compute the expectation

为了清晰度考虑,其中像素指数合理地偏移,而wpa,b是在第p层卷积核中(a,b)处的卷积权重。在每一层,初始权重是独立地从一个以0为均值和以C为方差的固定分布中得到的。我们假设梯度g是独立于权重的。这个假设通常不为真,因为网络一般包含非线性,但是对于线性网络来说这些假设是为真的。因为,我们可以计算期望

(7)

Here the expectation is taken over w distribution as well as the input data distribution. The variance is more interesting, as

这里期望也是在w的分布上进行的,而不只是输入数据的分布。方差更有趣,因为

(8)

This is equivalent to convolving the gradient variance image Var[g(,,p)] with a k × k convolution kernel full of 1’s, and then multiplying by C to get Var[g(,, p-1)].

这是与使用一个全为1的k×k的卷积核与梯度方差图片Var[g(,,p)]进行卷积相关的,然后乘以C得到Var[g(,,p-1)]。

Based on this we can apply exactly the same analysis as in Section 2.1 on the gradient variance images. The conclusions carry over easily that Var[g(., ., 0)] has a Gaussian shape, with only a slight change of having an extra Cn constant factor multiplier on the variance gradient images, which does not affect the relative distribution within a receptive field.

2.3 Non-uniform kernels

基于此我们可以使用如2.1节关于梯度方差图片相同的分析。结论很容易得出Var[g(.,.,0)]有高斯形状,只有一点改变就是在方差梯度图像上有一个额外的Cn常量因子乘数,它并不影响一个感受野内相关的分布。

2.3 Non-uniform kernels

2.3 非归一核

More generally, each pixel in the kernel window can have different weights, or as in the random weight case, they may have different variances. Let’s again consider the 1D case, u(t) =δ(t) as before, and the kernel signal , where w(m) is the weight for the mth pixel in the kernel. Without loss of generality, we can assume the weights are normalized, i.e. .

更一般地,在核窗口中每一个像素能有不同的权重,或者如随机权重的情况下,它们可能有不同的方差。让我们再一次考虑1维的情况,如之前u(t) =δ(t),而核信号,其中w(m)是核中第m个像素的权重。没有损失一般性的情况下,我们可以假设权重是归一化的,例如。

Applying the Fourier transform and convolution theorem as before, we get

如前所述应用傅里叶变换和卷积定理,我们得到

       (9)

the space domain signal o(t) is again the coefficient of e-jωt in the expansion; the only difference is that the e-jωt terms are weighted by w(m).

空间域信号o(t)也是e-jωt以扩展形式的协因子;唯一的区别就是e-jωt项是以w(m)为权重。

These coefficients turn out to be well studied in the combinatorics literature, see for example [3] and the references therein for more details. In [3], it was shown that if w(m) are normalized, then o(t) exactly equals to the probability p(Sn = t), where Sn = ∑ni=1Xi and Xi’s are i.i.d. multinomial variables distributed according to w(m)’s, i.e. p(Xi = m) = w(m). Notice the analysis there requires that w(m) > 0. But we can reduce to variance analysis for the random weight case, where the variances are always nonnegative while the weights can be negative. The analysis for negative w(m) is more difficult and is left to future work. However empirically we found the implications of the analysis in this section still applies reasonably well to networks with negative weights.

这些协因子已经在组合文献中广泛地研究了,看例子[3]和参考文献查看更多的细节。在[3]中,显示如果w(m)是归一化的,那么o(t)精确地等于概率p(Sn=t),而而Xi们是符合w(m)的i.i.d多项式变量分布,例如p(Xi=m)=w(m)。注意那个地方的分析需要w(m)>0。但是对于随机权重的情况我们可以降低至方差分析,其中方差总是非负的而权重可以是负的。对于负的w(m)的分析更难,留给未来的研究。但是经验上我们发现这一节的分析的逻辑也可以应用于负的权重上。

From the central limit theorem point of view, it dictates that as n à ∞, the distribution of  converges to Gaussian N(0, Var[X]) in distribution. This means, for a given n large enough, Sn is going to be roughly Gaussian with mean nE[X] and variance nVar[X]. As o(t) = p(Sn = t), this further implies that o(t) also has a Gaussian shape. When w(m)’s are normalized, this Gaussian has the following mean and variance:

从中心限制理论的观点来看,它表明随着nà∞,的分布在分布上趋向于高斯N(0,Var[X])。这意味着,对于一个给定的n大的足够的数,Sn将粗略地呈高斯分布以nE[X]为均值以nVar[X]为方差。由于o(t)=p(Sn=t),这进一步表明了o(t)也有一个高斯的形状。当w(m)们归一化时,高斯将有如下的均值和方差:

   (10)

This indicates that o(t) decays from the center of the receptive field squared exponentially according to the Gaussian distribution. The rate of decay is related to the variance of this Gaussian. If we take one standard deviation as the effective receptive field (ERF) size which is roughly the radius of the ERF, then this size is .

根据高斯分布,这表明了o(t)以平方指数的形式从感受域中心开始衰减。衰减率与高斯方差相关。如果我们将标准偏导作为有效感受野(ERF)尺寸大小,它大致上是ERF的半径,那么这个尺寸为。

On the other hand, as we stack more convolutional layers, the theoretical receptive field grows linearly, therefore relative to the theoretical receptive field, the ERF actually shrinks at a rate of , which we found surprising.

从另一方面,当我们叠加更多的卷积层时,理论感受野将线性增长,因此与理论感受野相关,ERF实际上衰减以作为衰减率的,这个发现令人惊讶。

In the simple case of uniform weighting, we can further see that the ERF size grows linearly with kernel size k. As w(m) = 1/k, we have

在归一权重的简单情况下,我们可以进一步发现ERF尺寸随着核尺寸k线性地增加。由于w(m)=1/k,我们有

(11)

Remarks: The result derived in this section, i.e., the distribution of impact within a receptive field in deep CNNs converges to Gaussian, holds under the following conditions: (1) all layers in the CNN use the same set of convolution weights. This is in general not true, however, when we apply the analysis of variance, the weight variance on all layers are usually the same up to a constant factor. (2) The convergence derived is convergence “in distribution”, as implied by the central limit theorem. This means that the cumulative probability distribution function converges to that of a Gaussian, but at any single point in space the probability can deviate from the Gaussian. (3) The convergence result states that , hence Sn approaches N(nE[X], nVar[X]), however the convergence of Sn here is not well defined as N(nE[X], nVar[X]) is not a fixed distribution, but instead it changes with n. Additionally, the distribution of Sn can deviate from Gaussian on a finite set. But the overall shape of the distribution is still roughly Gaussian.

评论:这一节推导的结果,例如,在深度CNN中的感受野内的影响分布趋向于高斯,成立的条件包括如下几个:(1)在CNN中所有的层使用相同一套的卷积权重。这一般情况下不成立,但是当我们应用方差分析时,在所有层上的权重方差通常相同是一个常量因子。(2)推导的收敛是“在分布上”的收敛,如中心局限定理所表示的。这意味着聚合概率分布函数收敛到高斯,但是对于任何一个单个的空间点,概率可能脱离收敛。(3)收敛结果表明,因此Sn接近N(nE[X], nVar[X]),但是Sn的收敛没有很好地定义,因为N(nE[X], nVar[X])不是一个固定的分布,它随着n而改变。另外,Sn的分布可能偏离高斯,如果数据集是有限的话。但是分布的大体形状仍然大致符合高斯分布。

2.4 Nonlinear activation functions

2.4 非线性激活函数

Nonlinear activation functions are an integral part of every neural network. We use σ to represent an arbitrary nonlinear activation function. During the forward pass, on each layer the pixels are first passed throughσand then convolved with the convolution kernel to compute the next layer. This ordering of operations is a little non-standard but equivalent to the more usual ordering of convolving first and passing through nonlinearity, and it makes the analysis slightly easier. The backward pass in this case becomes

非线性激活函数是一个将每一个所有神经网络的积分。我们使用σ来表示一个任意非线性激活函数。在前向传递中,在每一个层上像素首先通过σ传输,然后使用卷积核进行卷积从而计算下一个层。这个操作顺序是有点非标准的但是类似于更通常的顺序:先卷积,然后通过非线性传递,然后它使分析稍微更容易。该例子中的反向传递变为了

        (12)

where we abused notation a bit and use σp’i,j to represent the gradient of the activation function for pixel (i, j) on layer p.

其中我们滥用了一点符号,使用σp’i,j表示激活函数在图层p中像素点(i,j)的梯度。

For ReLU nonlinearities, σp ’i,j = I[xpi,j > 0] where I[.] is the indicator function. We have to make some extra assumptions about the activations xpi,j to advance the analysis, in addition to the assumption that it has zero mean and unit variance. A standard assumption is that xpi,j has a symmetric distribution around 0 [7]. If we make an extra simplifying assumption that the gradients σ’ are independent from the weights and g in the upper layers, we can simplify the variance as , and   is a constant factor. Following the variance analysis we can again reduce this case to the uniform weight case.

对于ReLU非线性,σp ’i,j = I[xpi,j > 0],其中I[.]是指示函数。除了0平均值和单位方差之外,我们必须对激活项xpi,j做出额外的假设,以提高分析。一个标准的假设就是xpi,j要在0附近有对称分布[7]。如果我们做出额外的简化假设(梯度σ’是独立于上层中的权重和g的),我们可以简化方差为,和是一个常数因子。依照方差分析,我们还可以再一次将该情况降低到归一权重的情况。

Sigmoid and Tanh nonlinearities are harder to analyze. Here we only use the observation that when the network is initialized the weights are usually small and therefore these nonlinearities will be in the linear region, and the linear analysis applies. However, as the weights grow bigger during training their effect becomes hard to analyze.

Sigmoid和Tanh非线性更难分析。这里我们只使用了观察,当网络初始化时权重通常很小,因此这些非线性将是在线性区域,而线性分析可以应用。但是,当权重在训练中变大时,它们的影响将变得很难分析。

2.5 Dropout, Subsampling, Dilated Convolution and Skip-Connections

2.5 剔除,子采样,加宽卷积和跳过-连接

Here we consider the effect of some standard CNN approaches on the effective receptive field. Dropout is a popular technique to prevent overfitting; we show that dropout does not change the Gaussian ERF shape. Subsampling and dilated convolutions turn out to be effective ways to increase receptive field size quickly. Skip-connections on the other hand make ERFs smaller. We present the analysis for all these cases in the Appendix.

这里我们考虑一些标准CNN方法在有效感受野上的影响。剔除是一个防止过拟合的流行方法;我们显示剔除并不会改变高斯ERF形状。子采样和加宽卷积表明是快速增加感受野尺寸有效的方法。跳过-连接从另一方面使ERF更小。我们在附录中列出了这些情况的分析。

3 Experiments

3 实验

In this section, we empirically study the ERF for various deep CNN architectures. We first use artificially constructed CNN models to verify the theoretical results in our analysis. We then present our observations on how the ERF changes during the training of deep CNNs on real datasets. For all ERF studies, we place a gradient signal of 1 at the center of the output plane and 0 everywhere else, and then back-propagate this gradient through the network to get input gradients.

在本节,我们对不同的深度CNN框架从经验上研究ERF。在我们的分析中,我们首先使用人工构建的CNN模型来验证理论结果。然后我们展示了我们的观测,使用真实数据集对于深度CNN在训练中ERF如何改变。对于所有的ERF研究,我们放置了一个梯度信号,在输出平面中心为1,其余位置为0,然后在网络上反向传播这个梯度以得到输入梯度。

3.1 Verifying theoretical results

3.1 验证理论结果

We first verify our theoretical results in artificially constructed deep CNNs. For computing the ERF we use random inputs, and for all the random weight networks we followed [7, 5] for proper random initialization. In this section, we verify the following results:

我们首先在人工构造的深度CNN上验证理论结果。对于计算ERF我们使用随机输入,而且对于所有的随机权重网络我们对于合理的随机初始化依照[7,5]进行。在本节,我们验证结果如下:

 

Figure 1: Comparing the effect of number of layers, random weight initialization and nonlinear activation on the ERF. Kernel size is fixed at 3 × 3 for all the networks here. Uniform: convolutional kernel weights are all ones, no nonlinearity; Random: random kernel weights, no nonlinearity; Random + ReLU: random kernel weights, ReLU nonlinearity.

图1:对比不同层、随机权重初始化和非线性激活对ERF的效果。核大小对于所有的网络固定在3×3。归一:卷积核权重全为1,没有非线性;随机:随机核权重,没有非线性;随机+ReLU:随机核权重,ReLU非线性。

ERFs are Gaussian distributed: As shown in Fig. 1, we can observe perfect Gaussian shapes for uniformly and randomly weighted convolution kernels without nonlinear activations, and near Gaussian shapes for randomly weighted kernels with nonlinearity. Adding the ReLU nonlinearity makes the distribution a bit less Gaussian, as the ERF distribution depends on the input as well. Another reason is that ReLU units output exactly zero for half of its inputs and it is very easy to get a zero output for the center pixel on the output plane, which means no path from the receptive field can reach the output, hence the gradient is all zero. Here the ERFs are averaged over 20 runs with different random seed. The figures on the right shows the ERF for networks with 20 layers of random weights, with different nonlinearities. Here the results are averaged both across 100 runs with different random weights as well as different random inputs. In this setting the receptive fields are a lot more Gaussian-like.

ERF是高斯分布的:如图1所示,我们可以观察到完美的高斯形状,对于线性激活的归一和随机权重卷积核,而对于非线性的随机权重核则在高斯形状附近。添加ReLU非线性使得分布有点不符合高斯分布了,因为ERF分布还依赖于输入。另一个原因是ReLU单元对于一半的输入只输出0,而且对于输出平面的中心像素很容易得到0输出,这意味着没有路径从感受野到达输出,因此梯度全为0。对于不同的随机因子这里ERF平均在20。右边的这些图显示对于20个层的随机权重的网络的ERF,不同的非线性。这里结果对于不同的随机权重除了随机输入平均横跨100。在这个设置下感受野更像高斯。

 

√n absolute growth and 1/√n relative shrinkage: In Fig. 2, we show the change of ERF size and the relative ratio of ERF over theoretical RF w.r.t number of convolution layers. The best fitting line for ERF size gives slope of 0.56 in log domain, while the line for ERF ratio gives slope of -0.43. This indicates ERF size is growing linearly w.r.t √N and ERF ratio is shrinking linearly w.r.t. 1/√N . Note here we use 2 standard deviations as our measurement for ERF size, i.e. any pixel with value greater than 1-95.45% of center point is considered as in ERF. The ERF size is represented by the square root of number of pixels within ERF, while the theoretical RF size is the side length of the square in which all pixel has a non-zero impact on the output pixel, no matter how small. All experiments here are averaged over 20 runs.

√n绝对增长和1/√n相对缩小:在图2中,我们看到ERF尺寸的改变和在理论RF w.r.t数量的卷积层上的ERF的相对率。对ERF尺寸的最佳拟合线给出在log域内0.56坡度,而ERF率的线给出-0.43的坡度。这表明ERF尺寸是线性增长的,w.r.t√N,而ERF率线性缩小的,w.r.t 1/√N。注意这里如我们对ERF尺寸的测量,我们使用了2个标准偏导,即任何值大于中心点的1-95.45%的元素都被考虑在ERF中。ERF尺寸表示为ERF内的像素数量的平方根,而理论RF尺寸是面积边长,其中所有的像素对输出像素有0影响,无论多小。这里所有的实验都有平均20轮。

Subsampling & dilated convolution increases receptive field: The figure on the right shows the effect of subsampling and dilated convolution. The reference baseline is a convnet with 15 dense convolution layers. Its ERF is shown in the left-most figure. We then replace 3 of the 15 convolutional layers with stride-2 convolution to get the ERF for the ‘Subsample’ figure, and replace them with dilated convolution with factor 2,4 and 8 for the ‘Dilation’ figure. As we see, both of them are able to increase the effect receptive field significantly. Note the ‘Dilation’ figure shows a rectangular ERF shape typical for dilated convolutions.

子采样&扩张卷积增加感受野:右侧的图显示子采样和扩展卷积的影响。参考的基线是一个带有15密度的卷积层的卷积网。它的ERF显示在最左边。我们然后在第2条的卷积中使用15卷积层取代了3,从而得到对于’子采样’图表的ERF,然后使用因子为2,4和8的扩张卷积取代它们得到’扩张’图。如我们看到的,两个都能显著地增加ERF。注意’扩张’图表显示了一个平方的ERF形状。

 

3.2 How the ERF evolves during training

3.2 ERF如何在训练中增长的

In this part, we take a look at how the ERF of units in the top-most convolutional layers of a classification CNN and a semantic segmentation CNN evolve during training. For both tasks, we adopt the ResNet architecture which makes extensive use of skip-connections. As the analysis shows, the ERF of this network should be significantly smaller than the theoretical receptive field. This is indeed what we have observed initially. Intriguingly, as the networks learns, the ERF gets bigger, and at the end of training is significantly larger than the initial ERF.

在本部分,我们看一下在分类CNN和一个语义CNN中的最上方的卷积层ERF是如何在训练中增长的。对于两个任务,我们采用了ResNet结构,它拓展使用了skip-connections。如分析显示的,该网络的ERF应该显著地小于理论感受野。根据我们的最初观测这是肯定的。有趣的是,当网络学习时,ERF变大了,并且在训练最后,它显著地大于最初的ERF。

 

Figure 2: Absolute growth (left) and relative shrink (right) for ERF

图2:对于ERF来说,绝对增长(左)和相对缩小(右)

 

Figure 3: Comparison of ERF before and after training for models trained on CIFAR-10 classification and CamVid semantic segmentation tasks. CIFAR-10 receptive fields are visualized in the image space of 32 × 32.

图3:对比ERF在CIFAR-10分类和CamVid语义分割任务中模型训练之前和之后。CIFAR-10感受野在32×32的图像空间可视化。

For the classification task we trained a ResNet with 17 residual blocks on the CIFAR-10 dataset. At the end of training this network reached a test accuracy of 89%. Note that in this experiment we did not use pooling or downsampling, and exclusively focus on architectures with skip-connections. The accuracy of the network is not state-of-the-art but still quite high. In Fig. 3 we show the effective receptive field on the 3232 image space at the beginning of training (with randomly initialized weights) and at the end of training when it reaches best validation accuracy. Note that the theoretical receptive field of our network is actually 74 74, bigger than the image size, but the ERF is still not able to fully fill the image. Comparing the results before and after training, we see that the effective receptive field has grown significantly.

对于分类任务我们在CIFAR-10数据集上对17个相邻块进行训练一个ResNet。在该网络训练的最后达到了测试精度89%。

For the semantic segmentation task we used the CamVid dataset for urban scene segmentation. We trained a “front-end” model [21] which is a purely convolutional network that predicts the output at a slightly lower resolution. This network plays the same role as the VGG network does in many previous works [12]. We trained a ResNet with 16 residual blocks interleaved with 4 subsampling operations each with a factor of 2. Due to these subsampling operations the output is 1/16 of the input size. For this model, the theoretical receptive field of the top convolutional layer units is quite big at 505 × 505. However, as shown in Fig. 3, the ERF only gets a fraction of that with a diameter of 100 at the beginning of training. Again we observe that during training the ERF size increases and at the end it reaches almost a diameter around 150.

对于语义分割任务我们使用CamVid数据集用于城市场景分割。我们训练“前-后”模型[21],它是一个纯粹的卷积网络,以稍微低一些的分辨率预测了输出。如VGG网络做之前的工作,该网络扮演相同的角色。我们使用16个相邻块间隔4个子采样操作(因子为2)训练了一个ResNet。由于这些子采样的操作,输出为输入尺寸的1/16。对于本模型,顶层卷积层单元的理论感受野超级大为505 × 505。但是,如图3所示,ERF只是得到那个的一部分,半径一开始为100。再一次我们观测到在训练中ERF尺寸增长,在最后达到了半径大约150。

4 Reduce the Gaussian Damage

4 降低高斯损失

The above analysis shows that the ERF only takes a small portion of the theoretical receptive field, which is undesirable for tasks that require a large receptive field.

上面的分析显示ERF只取得了理论上感受野的一部分,理论感受野是对于需要大感受野想要得到的任务。

New Initialization. One simple way to increase the effective receptive field is to manipulate the initial weights. We propose a new random weight initialization scheme that makes the weights at the center of the convolution kernel to have a smaller scale, and the weights on the outside to be larger; this diffuses the concentration on the center out to the periphery. Practically, we can initialize the network with any initialization method, then scale the weights according to a distribution that has a lower scale at the center and higher scale on the outside.

新的初始化。一个提高有效感受野的简单的方法是操纵初始权重。我们提出一个新的随机权重初始化计划,使得在卷积核中心的权重可以有一个更小的比例,而外部的权重可以更大;

In the extreme case, we can optimize the w(m)’s to maximize the ERF size or equivalently the variance in Eq. 10. Solving this optimization problem leads to the solution that put weights equally at the 4 corners of the convolution kernel while leaving everywhere else 0. However, using this solution to do random weight initialization is too aggressive, and leaving a lot of weights to 0 makes learning slow. A softer version of this idea usually works better.

 

We have trained a CNN for the CIFAR-10 classification task with this initialization method, with several random seeds. In a few cases we get a 30% speed-up of training compared to the more standard initializations [5, 7]. But overall the benefit of this method is not always significant.

 

We note that no matter what we do to change w(m), the effective receptive field is still distributed like a Gaussian so the above proposal only solves the problem partially.

 

Architectural changes. A potentially better approach is to make architectural changes to the CNNs, which may change the ERF in more fundamental ways. For example, instead of connecting each unit in a CNN to a local rectangular convolution window, we can sparsely connect each unit to a larger area in the lower layer using the same number of connections. Dilated convolution [21] belongs to this category, but we may push even further and use sparse connections that are not grid-like.

 

posted on 2018-05-21 19:26  XiaoNiuFeiTian  阅读(951)  评论(0编辑  收藏  举报