单层和双层神经网络反向传播公式推导(从矩阵求导的角度)
最近在跟着Andrew Ng老师学习深度神经网络.在学习浅层神经网络(两层)的时候,推导反向传播公式遇到了一些困惑,网上没有找到系统推导的过程.后来通过学习矩阵求导相关技巧,终于搞清楚了.首先从最简单的logistics回归(单层神经网络)开始.
logistics regression中的梯度下降法
单训练样本的logistics regression
输入训练样本为\(x\),网络权重为\(w\)和\(b\),其中\(x\)为列向量,向量维度为\((n_0,1)\),\(w\)为行向量,向量维度为\((1,n_0)\),\(b\)为标量.则神经网络的输出为$$a = \sigma(z),z = wx + b$$其中,\(\sigma()\)函数为sigmoid函数,其定义为$$\sigma(x) = \frac{1}{1+e^{-x}}$$
网络的loss函数定义为:$$ l(a) = -(yloga+(1-y)log(1-a))$$
其中,\(y\)为训练样本标签,对于logistics regression\(y = 0/1\).
- 下面首先求解\(\frac{\partial l}{\partial z}\):
- 下面求解\(\frac{\partial l}{\partial w}\):
由于w为行向量,上面的求导为标量对向量的求导.可以按照标量对向量求导的定义来计算,即$$\frac{\partial l}{\partial w} = [\frac{\partial l}{\partial w_1},\frac{\partial l}{\partial w_2},...,\frac{\partial l}{\partial w_{n_0}}]$$当然此处可以利用标量求导的链式法则,将\(\frac{\partial l}{\partial w_i} = \frac{\partial l}{\partial z}\frac{\partial z}{\partial w_i}\)带入进行计算.
但是,为了与后续向量化实现和两层神经网络的求导相一致,此处利用矩阵求导的法则进行计算,虽然有杀鸡用牛刀的嫌疑.首先明确一点,标量的链式求导法则并不适用于向量,不能相当然的套用,我就是犯了这个错误,在自己推导公式时百思不得其解.但是矩阵求导也有类似与标量的链式法则,下面直接给出公式:
其中,dl指的是标量l的微分,W为矩阵,tr为迹运算.若dl,dW能满足上面这种形式,则dW前面部分就是标量l对矩阵W的导数.此处简单的举个例子:
\(f = a^{T}Xb\),\(f\)为标量,\(a,b\)为列向量,\(X\)为矩阵,求\(\frac{\partial f}{\partial X}\),解答过程如下:
对照上面的公式,可得\(\frac{\partial f}{\partial X} = ab^{T}\).上面的推导过程用了部分矩阵微分公式如\(d(XY) = dXY+XdY\),还包括迹运算的技巧,如\(tr(ABC) = tr(CAB) = tr(BAC)\),更详细的关于矩阵求导的内容请参考博主叠加态的猫
3. 下面求解\(\frac{\partial l}{\partial b}\):
由于b为标量,可以简单求得\(\frac{\partial l}{\partial b} = \frac{\partial l}{\partial z}\)
m个训练样本的logistics regression向量化实现
单次输入的训练样本是\(X\),\(X\)为矩阵,维度为\((n_{0},m)\).网络权重为\(\boldsymbol{w}\)和\(b\),\(\boldsymbol{w}\)为行向量,向量维度为\((1,n_0)\),\(\boldsymbol{b}=\overrightarrow{1}^{T}b\).则神经网络的输出为$$\boldsymbol{a} = \sigma(\boldsymbol{z}),\boldsymbol{z} = \boldsymbol{w}X + \boldsymbol{b}$$|
\(\boldsymbol{z},\boldsymbol{a}\)均为行向量,维度为\((1,m)\).cost函数定义为:
也可以定义为矩阵的形式:
\(\overrightarrow{1}\)为全为1的列向量
- 下面首先求解\(\frac{\partial J}{\partial \boldsymbol{z}}\):
不管通过标量对向量求导的定义,或者利用矩阵"链式法则"都能求得:
注意此处J对z的导数与Andrew Ng老师的结果有点区别,多了一个\(\frac{1}{m}\),私以为严格按照求导公式,\(\frac{1}{m}\)是该有的,虽然Andrew Ng老师在dw,db前加上了\(\frac{1}{m}\),所以对最终的迭代并无影响.
2. 下面求解\(\frac{\partial J}{\partial \boldsymbol{w}}\):
已知\(dJ=tr(\frac{\partial J^{T}}{\partial \boldsymbol{z}}d\boldsymbol{z})\),将上式带入可得:
因此,\(\frac{\partial J}{\partial \boldsymbol{w}}=\frac{\partial J}{\partial \boldsymbol{z}}X^{T}\)
3. 下面求解\(\frac{\partial J}{\partial b}\):
因此,\(\frac{\partial J}{\partial b}=\frac{\partial J}{\partial \boldsymbol{z}}\overrightarrow{1}\)
双层神经网络中的梯度下降法
神经网络的输入,隐含层,输出层神经元个数分别为\(n_0,n_1,n_2=1\),其中隐含层激活函数为\(g()\),参数为\(W_1,\boldsymbol{b_1}\),\(W_1\)为矩阵,维度\((n_1,n_0)\),\(\boldsymbol{b_1}\)为列向量,维度\((n_1,1)\).输出层激活函数选择sigmoid函数,参数为\(\boldsymbol{w_2},b_2\),\(\boldsymbol{w_2}\)为行向量,维度为\((n_1,1)\),\(b_2\)为标量.
单个训练样本推导
输入\(\boldsymbol{x}\),则网络的正向传递过程如下:
loss函数定义与logistics regression相同
- 首先求解\(\frac{\partial l}{\partial z_2}\):
与logistics regression方式相同,可得\(\frac{\partial l}{\partial z_2}=a_2-y\) - 下面求解\(\frac{\partial l}{\partial \boldsymbol{w_2}}\):
与logistics regression方式相同,可得\(\frac{\partial l}{\partial \boldsymbol{w_2}}=\frac{\partial l}{\partial z_2}\boldsymbol{a_1}^{T}\) - 相同方式可求解\(\frac{\partial l}{\partial b_2}=\frac{\partial l}{\partial z_2}\)
- 求解\(\frac{\partial l}{\partial \boldsymbol{z_1}}\):
因此,\(\frac{\partial l}{\partial \boldsymbol{z_1}}=\boldsymbol{w_2}^{T}\frac{\partial l}{\partial z_2}*g^{'}(\boldsymbol{z_1})\),其中*为逐元素相乘,上面公式推导过程中运用了迹的性质,\(tr(A^{T}(B*C))=tr((A*B)^{T}C)\)
5. 求解\(\frac{\partial l}{\partial W_1}\):
因此,\(\frac{\partial l}{\partial W_1}=\frac{\partial l}{\partial \boldsymbol{z_1}}\boldsymbol{x}^{T}\)
6. 求解\(\frac{\partial l}{\partial \boldsymbol{b_1}}\):
因此,\(\frac{\partial l}{\partial \boldsymbol{b_1}}=\frac{\partial J}{\partial \boldsymbol{z_1}}\)
m个训练样本向量化实现的推导
输入\(X\),\(X\)为矩阵,维度为\((n_1,m)\)则网络的正向传递过程如下:
- 下面首先求解\(\frac{\partial J}{\partial \boldsymbol{z_2}}\):
与logistics regression中方法相同,可得\(\frac{\partial J}{\partial \boldsymbol{z_2}}=\frac{1}{m}(\boldsymbol{a_2}-\boldsymbol{Y})\) - 下面求解\(\frac{\partial J}{\partial \boldsymbol{w_2}}\):与logistics regression中方法相同,可得\(\frac{\partial J}{\partial \boldsymbol{w_2}}=\frac{\partial J}{\partial \boldsymbol{z_2}}\boldsymbol{a_1}^{T}\)
- 下面求解\(\frac{\partial J}{\partial b_2}\):同logistics regression可得\(\frac{\partial J}{\partial b_2}=\frac{\partial J}{\partial \boldsymbol{z_2}}\overrightarrow{1}\)
- 下面求解\(\frac{\partial J}{\partial Z_1}\):
与单个训练样本方法相同,可得\(\frac{\partial J}{\partial Z_1}=\boldsymbol{w_2}^{T}\frac{\partial J}{\partial \boldsymbol{z_2}}*g^{'}(\boldsymbol{z_1})\) - 求解\(\frac{\partial J}{\partial W_1}\):
与单个训练样本方法相同,可得\(\frac{\partial J}{\partial W_1}=\frac{\partial J}{\partial Z_1}\boldsymbol{X}^{T}\) - 求解\(\frac{\partial J}{\partial \boldsymbol{b_1}}\):
因此,\(\frac{\partial J}{\partial \boldsymbol{b_1}}=\frac{\partial J}{\partial Z_1}\overrightarrow{1}\)