神经网络BP算法推导

Cost Function

  • \(L\):表示神经网络的总层数。

  • \(S_l\):表示第\(l\)层的单元数(即神经元的数量),其中不包括第\(l\)层的偏置单元。

  • \(K\):输出层单元数目。

  • \(\left( h_\Theta \left( x^{\left( i\right)} \right) \right)_k\):表示输出神经网络输出向量中的第\(k\)个元素。

  • 如果一个网络在第\(j\)层有\(S_j\)个单元,在\(j+1\)层有\(S_{j+1}\)个单元,那么矩阵\(\theta^{\left(j\right)}\),即控制第\(j\)层到第\(j+1\)层映射的矩阵,他的维度是\(S_{j+1}\times(S_j+1)\).

  • 神经网络代价函数(带正则化项):

\[J(\theta)=\frac1m \sum_{i=1}^m \sum_{k=1}^K \left[ - y^{\left(i \right)}_k log \left( h_\Theta \left( x^{\left( i\right)} \right) \right)_k - \left( 1-y^{\left( i\right)}_k \right) log \left( 1- \left( h_\Theta \left( x^{\left( i\right)} \right) \right)_k \right) \right] + \frac {\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{S_l} \sum_{j=1}^{S_{l+1}} \left( \Theta_{ji}^ { \left( l \right) } \right)^2 \]

  • 看起来复杂很多的代价函数背后的思想还是一样的,通过代价函数来观察算法预测的结果与真实情况的误差有多大。
  • 唯一不同的是对于每一行特征,都会给出\(K\)个预测。在\(K\)个预测中选择可能性最高的一个,将其与\(y\)中的实际数据进行比较。
  • 正则化的那一项是排除了每层的\(\theta_0\)后,每一层的\(\theta\)矩阵的和。最里面的循环\(j\)循环所有的行,由\(S_{l+1}\)层的激活单元数决定;循环\(i\)则循环所有的列,由\(S_l\)层的激活单元数决定。

Forward propagation

假设有一个三层的神经网络
在这里插入图片描述

参数含义:

  • \(\theta^{\left(i\right)}\):第\(i\)层的参数矩阵(即控制第\(i\)层到第\(i+1\)层映射的矩阵)
  • \(z^{\left(l\right)}\):第\(l\)层的输入
  • \(a^{\left(l\right)}\):第\(l\)层的输出

向前传播算法过程:(其中\(g\)为sigmoid激活函数。)

  • \(a^{\left(1\right)}=x \quad (add \quad a_0^{\left(1\right)})\)
  • \(z^{\left(2\right)}=\theta^{\left(1\right)}a^{\left(1\right)}\)
  • \(a^{\left(2\right)}=g\left(z^{\left(2\right)}\right) \quad (add \quad a_0^{\left(2\right)})\)
  • \(z^{\left(3\right)}=\theta^{\left(2\right)}a^{\left(2\right)}\)
  • \(a^{\left(3\right)}=g\left(z^{\left(3\right)}\right)=h_\theta(x)\)

Backpropagation Algorithm

参数含义:

  • \(\delta^{\left(l\right)}\):第\(l\)层的误差
  • \(y\):每个样本的标签(要转化成非线性相关的向量)
  • \(h\):每个样本的预测值

先来从后往前计算每层的“误差“。注意到这里的误差用双引号括起来,因为并不是真正的误差。

  • \(\delta^{\left(3\right)}=h-y=a^{\left(3\right)}-y\)
  • \(\delta^{\left(2\right)}=\left(\theta^{\left(2\right)}\right)^T·\delta^{\left(3\right)}.*{g^\prime (z^{\left(2\right)}) }\)
  • 没有\(\delta^{\left(1\right)}\),因为第一层是输入变量不存在误差。

然后来计算每层参数矩阵的梯度,用\(\Delta^{(l)}\)表示。
\(\lambda=0\)时,即不考虑正则化处理

  • \(\Delta^{(2)}=a^{\left(2\right)}·\delta^{\left(3\right)}\)
  • \(\Delta^{(1)}=a^{\left(1\right)}·\delta^{\left(2\right)}\)

则利用反向传播算法计算所有的偏导数(不考虑正则化)

  • \(\frac{\partial}{\partial \Theta^{\left( 2 \right)}} J\left(\Theta \right) = \Delta^{\left( 2 \right)}=a^{\left(2\right)}·\delta^{\left(3\right)}\)
  • \(\frac{\partial}{\partial \Theta^{\left( 1 \right)}} J\left(\Theta \right) = \Delta^{\left( 1 \right)}=a^{\left(1\right)}·\delta^{\left(2\right)}\)

推导:

  • 要优化的参数有\(\theta^{\left( 2 \right)}\)\(\theta^{\left( 1 \right)}\)
  • 利用梯度下降法的思想,我们只需要求解出代价函数对参数的梯度即可。
  • 假设只有一个输入样本,则代价函数是:

\[J\left(\theta \right)=-y \log h(x)-(1-y)\log (1-h(x)) \]

  • 回顾下正向传递的过程,理解其中函数的嵌套关系:
    • \(a^{\left(1\right)}=x \quad (add \quad a_0^{\left(1\right)})\)
    • \(z^{\left(2\right)}=\theta^{\left(1\right)}a^{\left(1\right)}\)
    • \(a^{\left(2\right)}=g\left(z^{\left(2\right)}\right) \quad (add \quad a_0^{\left(2\right)})\)
    • \(z^{\left(3\right)}=\theta^{\left(2\right)}a^{\left(2\right)}\)
    • \(a^{\left(3\right)}=g\left(z^{\left(3\right)}\right)=h_\theta(x)\)
  • 然后我们来求解代价函数对参数的梯度:\(\frac{\partial}{\partial \theta^{\left( 2 \right)}} J\left(\theta \right)\)\(\frac{\partial}{\partial \theta^{\left( 1 \right)}} J\left(\theta \right)\)
  • 根据链式求导法则,可以计算得到:

\[\frac{\partial}{\partial \theta^{\left( 2 \right)}} J\left(\theta \right)=\frac{\partial J\left(\theta \right)}{\partial a^{\left( 3 \right)}}·\frac{\partial a^{\left( 3 \right)}}{\partial z^{\left( 3 \right)}}·\frac {\partial z^{\left( 3 \right)}}{\partial \theta^{\left( 2 \right)}} \]

- 其中,$\frac{\partial J\left(\theta \right)}{\partial  a^{\left( 3 \right)}}=\frac{\partial J\left(\theta \right)}{\partial  h(x)}=-\frac yh+(1-y)\frac1{1-h}=\frac{h-y}{h·(1-h)}$
- $\frac{\partial a^{\left( 3 \right)}}{\partial z^{\left( 3 \right)}}=\frac{ e^{-z} }{ (1+e^-z)^2}=a^{\left( 3 \right)}·\left(1-a^{\left( 3 \right)}\right)=h·(1-h)$
- $\frac {\partial z^{\left( 3 \right)}} {\partial \theta^{\left( 2 \right)}}=a^{(2)}$
- 所以可得:

\[\frac{\partial}{\partial \theta^{\left( 2 \right)}} J\left(\theta \right)=(h-y)·a^{(2)}=\delta^{\left(3\right)}·a^{(2)} \]

  • 同理,根据链式求导法则,计算:

\[\frac{\partial}{\partial \theta^{\left( 1 \right)}} J\left(\theta \right)=\frac{\partial J\left(\theta \right)}{\partial a^{\left( 3 \right)}}·\frac{\partial a^{\left( 3 \right)}}{\partial z^{\left( 3 \right)}}·\frac {\partial z^{\left( 3 \right)}}{\partial a^{\left( 2 \right)}}·\frac{\partial a^{\left( 2 \right)}}{\partial z^{\left( 2 \right)}}·\frac {\partial z^{\left( 2 \right)}}{\partial \theta^{\left( 1 \right)}} \]

- 其中,由上可知,$\frac{\partial J\left(\theta \right)}{\partial  a^{\left( 3 \right)}}·\frac{\partial a^{\left( 3 \right)}}{\partial z^{\left( 3 \right)}}=h-y=\delta^{\left(3\right)}$
- $\frac {\partial z^{\left( 3 \right)}}{\partial a^{\left( 2 \right)}}=\theta^{(2)}$
- $\frac{\partial a^{\left( 2 \right)}}{\partial z^{\left( 2 \right)}}=g^\prime\left(z^{\left(2\right)}\right)$
- $\frac {\partial z^{\left( 2 \right)}}{\partial \theta^{\left( 1 \right)}}=a^{(1)}$
- 所以可得:

\[\frac{\partial}{\partial \theta^{\left( 1 \right)}} J\left(\theta \right)=\delta^{\left(3\right)} \theta^{(2)} g^\prime\left(z^{\left(2\right)}\right) a^{(1)}=\delta^{\left(2\right)}·a^{(1)} \]

\(\lambda \neq 0\)时,即考虑正则化处理:

\[\frac{\partial}{\partial \Theta_{ij}^{\left( l \right)}} J\left(\Theta \right) = D_{ij}^{\left( l \right)}= \frac1m \Delta_{ij}^{\left( l \right)} \quad for \quad j=0 \]

\[\frac{\partial}{\partial \Theta_{ij}^{\left( l \right)}} J\left(\Theta \right) = D_{ij}^{\left( l \right)}= \frac1m \Delta_{ij}^{\left( l \right)}+\frac{\lambda}{m} \Theta_{ij}^{\left( l \right)} \quad for \quad j \geq 1 \]

参考资料

BP(反向传播)神经网络

posted @ 2020-03-05 11:48  Ylxxxxx  阅读(391)  评论(0编辑  收藏  举报