[Xavier] Understanding the difficulty of training deep feedforward neural networks

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]. international conference on artificial intelligence and statistics, 2010: 249-256.

@article{glorot2010understanding,
title={Understanding the difficulty of training deep feedforward neural networks},
author={Glorot, Xavier and Bengio, Yoshua},
pages={249--256},
year={2010}}

本文提出了Xavier参数初始化方法.

主要内容

在第\(i=1, \ldots, d\)层:

\[\mathbf{s}^i=\mathbf{z}^i W^i+\mathbf{b}^i \\ \mathbf{z}^{i+1}= f(\mathbf{s}^i), \]

其中\(\mathbf{z}^i\)是第\(i\)层的输入, \(\mathbf{s}^i\)是激活前的值, \(f(\cdot)\)是激活函数(假设其在0点对称, 且\(f'(0)=1\) 如tanh).

\[\mathrm{Var}(z^i) = n_l\mathrm{Var}(w^iz^i), \]

\(0\)附近近似成立(既然\(f'(0)=1\)), 其中\(z^i, w^i,\)分别是\(\mathbf{z}^i, W^i\)的某个元素, 且假设这些\(\{w^i\}\)之间是独立同分布的, \(w^i, z^i\)是相互独立的, 进一步假设\(\mathbb{E}(w^i)=0,\mathbb{E}(x)=0\)(\(x\)是输入的样本), 则

\[\mathrm{Var}(z^i) = n_l\mathrm{Var}(w^i)\mathrm{Var}(z^i), \]

\(0\)点附近近似成立.

\[\mathrm{Var}(z^i) = \mathrm{Var}(x) \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}(w_{i'}) \]

其中\(n_i\)表示第\(i\)层输入的节点个数.

根据梯度反向传播可知:

\[\tag{2} \frac{\partial Cost}{\partial s_k^i} = f'(s_k^i) W_{k, \cdot}^{i+1} \frac{\partial Cost}{\partial \mathbf{s}^{i+1}} \]

\[\tag{3} \frac{\partial Cost}{\partial w_{l,k}^i} = z_l^i \frac{\partial Cost}{\partial s_k^i}. \]

于是

\[\tag{6} \mathrm{Var}[\frac{\partial Cost}{\partial s_k^i}] = \mathrm{Var}[\frac{\partial Cost}{\partial s^d}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}], \]

\[\mathrm{Var}[\frac{\partial Cost}{\partial w^i}] = \prod_{i'=0}^{i-1} n_{i'} \mathrm{Var}[w^{i'}] \prod_{i'=i}^d n_{i'+1} \mathrm{Var} [w^{i'}] \times \mathrm{Var}(x) \mathrm{Var}[\frac{\partial Cost}{\partial s^d}], \]

当我们要求前向进程中关于\(z^i\)的方差一致, 则

\[\tag{10} \forall i, \quad n_i \mathrm{Var} [w^i]=1. \]

当我们要求反向进程中梯度的方差\(\frac{\partial Cost}{\partial s^i}\)一致, 则

\[\tag{11} \forall i \quad n_{i+1} \mathrm{Var} [w^i]=1. \]

本文选了一个折中的方案

\[\mathrm{Var} [w^i] = \frac{2}{n_{i+1}+n_{i}}, \]

并构造了一个均匀分布, \(w^i\)从其中采样

\[w^i \sim U[-\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}},\frac{\sqrt{6}}{\sqrt{n_{i+1}+n_{i}}}]. \]

文章还有许多关于不同的激活函数的分析, 如sigmoid, tanh, softsign... 这些不是重点, 就不记录了.

posted @ 2020-04-23 10:51  馒头and花卷  阅读(414)  评论(0编辑  收藏  举报