[Converge] Training Neural Networks

CS231n Winter 2016: Lecture 5: Neural Networks Part 2

CS231n Winter 2016: Lecture 6: Neural Networks Part 3

by Andrej Karpathy

本章节主要讲解激活函数，参数初始化以及周边的知识体系。

Ref: 《深度学习》第八章 - 深度模型中的优化

Overview

1. One time setup

- activation functions,
- preprocessing,
- weight initialization,
- regularization,
- gradient checking

2. Training dynamics

- babysitting the learning process,
- parameter updates,
- hyperparameter optimization

3. Evaluation

- model ensembles

激活函数

一、若干常用 Activation functions

二、讲解

Sigmoid's three problems:

1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zerocentered
3. exp() is a bit compute expensive

Tanh 跟sigmoid一样具有毛病1。

Why, “kill” the gradients?

ReLU

Computes f(x) = max(0,x)

- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output
- An annoyance: what is the gradient when x < 0? “kill” the gradients. When x = 0, no gradient.

Leaky ReLU

Computes f(x) = max(0.01x, x)

- Does not saturate
- Computationally efficient
- Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Parametric Rectifier (PReLU)

Computes f(x) = max(alpha*x, x)

backprop into \alpha
(parameter)

Exponential Linear Units (ELU)

- All benefits of ReLU
- Does not die
- Closer to zero mean outputs
- Computation requires exp()

Maxout “Neuron”

- Does not have the basic form of dot product ->nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron

三、总结：TLDR: In practice

- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

预处理

In practice, you may also see PCA and Whitening of the data.

可见，白化的威力！

TLDR: In practice for Images: center only

We don't have separate different features that could be at different units.

Everything is just pixels and they're all bounded between 0 and 255.

So, it is not common to normalize the data, but it's very common to zero center your data.

【以下有点“减小亮度影响”的意思】

- Subtract the mean image (e.g. AlexNet)
　　(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
　　(mean along each channel = 3 numbers)

Not common to normalize variance, to do PCA or whitening.

权重初始化

First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

一个十层网络的实验：权重趋于0。

首先，这个实验有必要自己亲自做一下。下图所示，初始值扩大100倍后，几乎所有的neuron趋于饱和，这不利于收敛。

深度网络的调参是个集技术和经验一身的skill，可参见如下的理论。

采用不同的激活函数，得到不同的结果，实践并体会背后的原理。

Proper initialization is an active area of research…

Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015

Batch Normalization

From: http://blog.csdn.net/hjimce/article/details/50866313

本篇博文主要讲解2015年深度学习领域，非常值得学习的一篇文献：《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》

附上另一个参考：http://blog.csdn.net/elaine_bao/article/details/50890491 （图示较多）

先拿小数据试一试策略，最好能达到overfit （acc = 100%）

以下这个loss不怎么变，可能learning rata太小；

但为什么acc最后突然好了许多？思考。

Hyperparameter Optimization

First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search

参数更新

可见，SGD总是最慢。

注意：加了momentum后的效果；以及改进后的momentum的紫色线(nag)的效果。

唯一的区别在于：

传统的：

改进的：

然后，通过“变量转换”带来计算上的便利。（nag）

AdaGrad and RMSProp

Adam: 貌似为上策、首选。

学习率的选择

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

先用快的，再减小，常用策略如下：

Second order optimization methods

- 牛顿法，不需要考虑学习率！

- Quasi-Newton methods (BGFS most popular):
instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2)
each).

- L-BFGS (Limited memory BFGS):
Does not form/store the full inverse Hessian. Usually works very well in full batch, deterministic mode; not for mini-batch setting.

In practice:

- Adam is a good default choice in most cases
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

Regularization (dropout)

Ref: http://www.jianshu.com/p/ba9ca3b07922

总结

Dropout 方法存在两种形式：直接的和 Inverted。
在单个神经元上面，Dropout 方法可以使用伯努利随机变量。
在一层神经元上面，Dropout 方法可以使用伯努利随机变量。
我们精确的丢弃 np 个神经元是不太可能的，但是在一个拥有 n 个神经元的网络层上面，平均丢弃的神经元就是 np 个。
Inverted Dropout 方法可以产生有效学习率。
Inverted Dropout 方法应该和别的规范化参数的技术一起使用，从而帮助简化学习率的选择过程。
Dropout 方法有助于防止深度神经网路的过拟合。

posted @ 2017-08-14 07:20 郝壹贰叁阅读(397) 评论(0) 编辑收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston