1. 反向传播算法介绍
误差反向传播(Error Back Propagation)算法,简称BP算法。BP算法由信号正向传播和误差反向传播组成。它的主要思想是由后一级的误差计算前一级的误差,从而极大减少运算量。
设训练数据为\(\{\bm{(x^{(1)},y^{(1)}),\cdots,(x^{(N)}),y^{(N)}}\}\)共\(N\)个,输出为\(n_L\)维,即\(\bm y^{(i)} = (y_1^{(i)},\cdots,y_{n_L}^{(i)})\)。
2. 信息前向传播
以第2层为例:
\[z_1^{(2)} = w_{11}^{(2)} x_1 + w_{12}^{(2)} x_2 + w_{13}^{(2)} x_3 + b_1^{(2)} \\
z_2^{(2)} = w_{21}^{(2)} x_1 + w_{22}^{(2)} x_2 + w_{23}^{(2)} x_3 + b_2^{(2)} \\
z_3^{(2)} = w_{11}^{(2)} x_1 + w_{12}^{(2)} x_2 + w_{13}^{(2)} x_3 + b_1^{(2)} \\
a_1^{(2)} = f(z_1^{(2)}) \\
a_2^{(2)} = f(z_2^{(2)}) \\
a_3^{(2)} = f(z_3^{(2)}) \\
\]
上述等式用向量化可表示为
\[\bm z^{(2)} = \bm W^{(2)} \cdot \bm a^{(1)} + b^{(2)} \\
\bm a^{(2)} = f(\bm z^{(2)})
\]
类似地,可归纳出
\[\bm z^{(l)} = \bm W^{(l)} \cdot \bm a^{(l)} + b^{(l)} \quad (2 \leq l \leq L) \\
\bm a^{(l)} = f(\bm z^{(l)})
\]
对L层神经网络,最终输出为\(\bm a^{(l)}\)。
从输入层到输出层,信息前向传播的流向为
\[\bm {x = a^{(1)} \to z^{(2)} \to \cdots \to a^{(L-1)} \to z^{(L)} \to a^{(L)} = y}
\]
- 误差反向传播
对单独一个训练数据\((\bm{x^{(i)},y^{(i)}})\)来说,代价函数(cost function)为
\[E^{(i)} = \frac 1 2 ||\bm{y^{(i)}-a^{(i)}}||^2 = \frac 1 2 \sum _{k=1}^{n_L} (y_k^{(i)}-a_k^{(i)})^2
\]
为了描述方便,省去上标\(^{(i)}\)(打公式的上标也很辛苦),将代价函数记为\(E\)。
总的损失函数为
\[E_{total} = \frac 1 N \sum _{i=1}^N E^{(i)}
\]
采用梯度下降法更新参数\(w_{ij}^{(l)},b_i^{l},\ 2 \leq l \leq L\)。
采用梯度下降法更新参数的公式为:
\[\bm W^{(l)} = \bm W^{(l)} - \mu \frac {\partial E_{total}}{\partial \bm W^{(l)}} = \bm W^{(l)} - \frac \mu N \sum _{i=1}^N \frac{\partial E^{(i)}}{\partial \bm W^{(l)}} \\
\bm b^{(l)} = \bm b^{(l)} - \mu \frac {\partial E_{total}}{\partial \bm b^{(l)}} = \bm b^{(l)} - \frac \mu N \sum _{i=1}^N \frac{\partial E^{(i)}}{\partial \bm b^{(l)}}
\]
3.1 输出层的权重参数更新
将\(E\)在隐藏层展开:
\[E = \frac 1 2 ||\bm y - \bm a^{(3)}|| = \frac 1 2 [(y_1-a_1^{(3)})^2+(y_2-a_2^{(3)})^2]
= \frac 1 2 [(y_1-f(z_1^{(3)}))^2+(y_2-f(z_2^{(3)}))^2] \\
= \frac 1 2 [(y_1-f(w_{11}^{(3)} a_1^{(2)} + w_{12}^{(3)} a_2^{(2)} + w_{13}^{(3)} a_3^{(2)} + b_1^{(3)}))^2+(y_2-f(w_{21}^{(3)} a_1^{(2)} + w_{22}^{(3)} a_2^{(2)} + w_{23}^{(3)} a_3^{(2)} + b_2^{(3)})))^2]
\]
由求导的链式法则,对隐藏层到输出层神经元的权重参数求偏导,有:
\[\frac {\partial E}{\partial w_{11}^{(3)}} = \frac 1 2 \cdot 2 \cdot (y_1-a_1^{(3)})(-\frac{\partial a_1^{(3)}}{\partial w_{11}^{(3)}})=-(y_1-a_1^{(3)})f'(z_1^{(3)})a_1^{(2)}
\]
记\(\frac{\partial E}{\partial z_i^{(l)}}\)记为\(\delta _i^{(l)}\),即\(\delta _i^{(l)} = \frac{\partial E}{\partial z_i^{(l)}}\),称为误差项(灵敏度),代表该层对最终总误差的影响大小。
\(\frac {\partial E}{\partial w_{11}^{(3)}}\)可写为:
\[\frac {\partial E}{\partial w_{11}^{(3)}} = \frac {\partial E}{\partial z_1^{(3)}} \cdot \frac {\partial z_1^{(3)}}{\partial w_{11}^{(3)}} = \delta _1^{(3)} a_1^{(2)}
\]
同理可求得
\[\frac {\partial E}{\partial w_{12}^{(3)}} = \delta _1^{(3)} a_2^{(2)}, \quad \frac {\partial E}{\partial w_{13}^{(3)}} = \delta _1^{(3)} a_3^{(2)}, \quad \frac {\partial E}{\partial w_{21}^{(3)}} = \delta _2^{(3)} a_1^{(2)}, \quad \frac {\partial E}{\partial w_{22}^{(3)}} = \delta _2^{(3)} a_2^{(2)}, \quad \frac {\partial E}{\partial w_{23}^{(3)}} = \delta _2^{(3)} a_3^{(2)}
\]
引入\(\delta _i^{(l)}\)一个很重要的原因是可通过\(\delta _{i+1}^{(l)}\)来求解\(\delta _i^{(l)}\),这样可以充分利用之前计算过的结果来加快整个计算过程,这也是反向传播算法的核心思想。
推广:
\[\delta _i^{(L)} = -(y_i-a_i^{(L)})f'(z_i^{(L)})\ (1 \leq i \leq n_L) \\
\frac {\partial E}{\partial w_{ij}^{(L)}} = \delta _i^{(L)} \cdot \ a_j^{(L-1)} \ (1 \leq i \leq n_L, 1 \leq j \leq n_{L-1})
\]
表示成向量形式:
\[\bm \delta ^{(L)} = -(\bm y-\bm a^{(L)}) \odot f'(\bm z^{(L)}) \\
\triangledown _{\bm W^{(L)}} E = \bm \delta ^{(L)} \cdot (\bm a^{(L-1)})^T
\]
其中,\(\odot\)表示哈达玛积(Hadamard Product)或称Element-wise Product,即2个矩阵对应位置的元素相乘。\(\triangledown _{\bm W^{(L)}} E\)得到一个新的矩阵,这个矩阵中第\(i\)行第\(j\)列的元素由\(E\)对\(\bm W^{(L)}\)中的元素\(w_{ij}^{(L)}\)求偏导得到。
先求出最后一行的误差,再通过反向传播一层一层向前传导,更新前面层的误差值。
3.2 输出层与隐藏层的权重参数更新
\[\frac {\partial E}{\partial w_{ij}^{(l)}} = \delta _i^{(l)} \cdot \ a_j^{(l-1)}
\]
其中,\(\delta _i^{(l)}\)与\(\bm \delta ^{(l+1)}\)(注意与下一层的所有误差项均有关,因此写成向量)的关系推导如下:
\[\delta _i^{(l)} = \frac {\partial E}{\partial z_i^{(l)}} = \sum _{j=1}^{n_{l+1}}\frac {\partial E}{\partial z_j^{(l+1)}} \frac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}} = \sum _{j=1}^{n_{l+1}} \delta _j^{(l+1)}\frac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}} \\
z_j^{(l+1)} = \sum _{j=1}^{n_{l}} w_{ji}^{(l+1)} a_i^{(l)} + b_j^{(l+1)} = \sum _{j=1}^{n_{l}} w_{ji}^{(l+1)} f(z_i^{(l)}) + b_j^{(l+1)}\\
\therefore \frac {\partial z_j^{(l+1)}}{\partial z_i^{(l)}} = \frac{\partial z_j^{(l+1)}}{\partial a_i^{(l)}} \frac{\partial a_i^{(l)}}{\partial z_i^{(l)}} = w_{ji}^{(l+1)} f'(z_i^{(l)})
\]
代入
\[\delta _i^{(l)} = \sum _{j=1}^{n_{l+1}} \delta _j^{(l+1)} w_{ji}^{(l+1)} f'(z_i^{(l)}) = (\sum _{j=1}^{n_{l+1}} \delta _j^{(l+1)} w_{ji}^{(l+1)}) \cdot f'(z_i^{(l)})
\]
表示成矩阵(向量)形式为:
\[\bm \delta^{(l)} = ((\bm W^{(l+1)})^T \bm \delta^{(l+1)}) \odot \bm f'(\bm z^{(l)})
\]
\(f(x)\)的一个重要性质就是
\[f'(x) = f(x)(1-f(x))
\]
3.3 输出层与隐藏层的偏执参数更新
\[\frac{\partial E}{\partial b_i^{(l)}} = \frac{\partial E}{\partial z_i^{(l)}} \frac{\partial z_i^{(l)}}{\partial b_i^{(l)}} = \delta _i^{(l)}
\]
表示成矩阵形式为:
\[\triangledown _{\bm b^{(l)}} E = \bm \delta^{(l)}
\]
3.4 4个核心公式
\[\begin{aligned}
& \delta _i^{(L)} = -(y_i-a_i^{(L)})f'(z_i^{(l)}) \\
& \delta _i^{(l)} = (\sum _{j=1}^{n_{l+1}} \delta _j^{(l+1)}w_{ji}^{(l+1)})f'(z_i^{(l)}) \\
& \frac{\partial E}{\partial w_{ij}^{(l)}} = \delta _i^{(l)} a_j^{(l-1)} \\
& \frac{\partial E}{\partial b_i^{(l)}} = \delta _i^{(l)}
\end{aligned}
\]
表示成矩阵形式为:
\[\begin{aligned}
& \bm \delta^{(L)} = -(\bm y - \bm a^{(L)}) \odot \bm f'(\bm z^{(L)}) \\
& \bm \delta^{(l)} = ((\bm W^{(l+1)})^T \bm \delta^{(l+1)}) \odot \bm f'(\bm z^{(l)}) \\
& \triangledown_{\bm W^{(l)}} E = \frac{\partial E}{\partial \bm W^{(l)}} = \bm \delta^{(l)}(\bm a^{(l-1)})^T \\
& \triangledown_{\bm b^{(l)}} E = \frac{\partial E}{\partial \bm b^{(l)}} = \bm \delta^{(l)}
\end{aligned}
\]