neural network 中 back propagation 的公式推导(未完成待续)
最近在Coursera上学习Andrew NG的machine learning, 感觉对back propagation的细节不甚清楚, 参考了http://neuralnetworksanddeeplearning.com/chap2.html后, 感觉对公式的原理清楚了许多. 在此与大家分享.
回顾:
sigmoid function:$f(x)=\frac{1}{1+e^{-x}}$.
第一层为output layer, 最后第$L$(上图中$L=3$)层为input layer, 中间层为hidden layer.
$L$: Total number of layers.
$m$: number of data.
$x^{(i)}$: $i$th training.
$s_j$: Unit (neural) number of layer $j$ (不算常数项系数$a_0^{(j)}$).
$a_i^{(j)}$: Activation of unit i of layer $j$, add $a_0^{j}$ as parameter of constant term.
$\Theta^{(j)}$: Weights control function which map from $j$th layer to $(j+1)$th layer. thus the dimension (size) of $\Theta^{(j)}$ is $(s_{j+1}*(s_j +1))$.
$z_i^{j}$: $x$ of $g(x)$ (sigmoid function) map from $(j-1)$th layer to $j$th layer to compute unit (neural) $i$.
in fact: $$z^{(j)}=\Theta^{(j-1)}\dot a^{(j-1)},$$ $$a^{(j)}=g(z^{(j)}).$$
for example:
$a_1^{(2)}=g(\Theta_{10}^{(1)}x_0 + ... + \Theta_{13}^{(1)}x_3)=g(z_1^{(2)})$
$a_2^{(2)}=g(\Theta_{20}^{(1)}x_0 + ... + \Theta_{23}^{(1)}x_3)=g(z_2^{(2)})$
$a_3^{(2)}=g(\Theta_{30}^{(1)}x_0 + ... + \Theta_{33}^{(1)}x_3)=g(z_3^{(2)})$
$y=h_\Theta(x)=a_1^{(3)}=g(\Theta_{10}^{(2)}x_0 + ... + \Theta_{13}^{(2)}x_3)=g(z_1^{(3)})$
其中dimension (size) of $\Theta^{(2)}$ 为 $1*4$, dimension of $\Theta^{(1)}$ 为 $3*4$.
直接考虑多分类问题(图中为二分类问题(binary classification)):
$K$: Numbers of output units.
即输出$y={1,2,3...,K}$分别转化为第k个单位为1, 其余单位为0的$K*1$向量.
cost function(与logistics函数相同): $$J_{\Theta}=-\frac{1}{m}[\sum_{i=1}^{m}\sum_{k=1}^{K}y_k^{(i)}log(h_{\Theta}(x^{(i)}))_k+(1-y_k^{(i)})log(1-(h_{\Theta}(x^{(i)}))_k)]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2$$
注意:(1).$(h_{\Theta}(x^{(i)}))_k$: $i$ th output.
(2). regularization时, $l=1\space to\space L-1$.
梯度下降法(gradient descent)回顾:
$$\theta_j:=\theta_j -\alpha \frac{\partial J_{\theta}}{\partial \theta_j}$$
因此back propagation目的即为求$\frac{\partial J_{\Theta}}{\partial \Theta_{ij}^{(l)}}$
定义$\delta_j^{(l)}$: error of node j in layer l. in fact, $\delta_j^{(l)}=\frac{\partial J_{\Theta}}{\partial z_j^ {(l)}}$
for ith of data $x^{(i)}$:
step0 preparation:
set $a^{(1)}=x^{(i)}$.
set $\Delta_{ij}^{(l)}=0$.
step1 Forward propagation:
compute $a^{(l)}, l=2...L$.
$$a^{(1)}=x, z^{(2)}=\Theta^{(1)}a^{(1)}$$ $$(add\space a_0^{(2)})a^{(2)}=g(z^{(2)}), z^{(3)}=\Theta^{(2)}a^{(2)}$$ $$...$$$$(add\space a_0^{(L)})a^{(L)}=g(z^{(L)})$$
step2 Backward propagation:
具体步骤为:
(1). 先计算$\delta^{(L)}$:
set $\frac{\partial J_{\Theta}}{\partial z_j^{(L)}}=\delta_j^{(l)}=a_j^{(l)}-y=(h_{\theta}(x)_j-y$
(2). 再计算$\delta^{(l)}, l=L-1,...,2$, 即更新$\Delta$:
$\delta^{(j)}=(\Theta^{(j)})^T\delta^{(j+1)}.*g'(z^{(j)})$, 故$\frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}$.
即为$\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$, 即为$\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T.$
for 循环结束.
step3 计算$\frac{\partial J_{\Theta}}{\partial \Theta_{ij}^{(l)}}$:
$D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} if j\not0$,
$D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)} if j=0$,
$\frac{\partial J_{\Theta}}{\partial \Theta_{ij}^{(l)}}=D_{ij}^{(l)}$
(1). set $a^{(1)}=x^{(i)}$.
(2). compute $a^{(l)}, l=2...L$.
(3). compute $\delta^{L-1},...,\delta^{2}$: