机器学习公开课笔记第四周,神经网络
当我们用非线性假设函数n元k次方程表达逻辑回归特征值,我们的算法的效率将是\( O\left ( n^{k} \right ) \)
当特征数n和k过大时,逻辑回归的算法效率将会非常低,神经网络算法就是为了解决非线性逻辑回归算法而诞生的
神经网络算法来源于模拟人类大脑神经,同一种神经放在不同的大脑区域可以学习各种能力,如听觉,知觉,视觉等等
- 神经网络比较像做n次逻辑回归,每次逻辑逻辑回归之后的输出是下一次逻辑回归的输入
- 神经网络分为三层,第一层是输入层,中间层是隐藏层,最后一层是输出层,
- 隐藏层又可以是多层,这个层数根据用户·需要来增加和减少,越多回归越精确,但计算量越大,效率越低
- 中间层的第j层的节点是\(a_{i}^{(j)}\),,第j层的参数\(\Theta^{(j)}\)是从第j层到j+1层的映射举证的权值
- 如果神经网络第j层含有Sj个单元,第j+1含有Sj+1个单元,那么权值矩阵\(\Theta^{(j)}\)的维度是\(s_{j+1}\times (s_{j} + 1)\)
逻辑运算的神经网络
当遇到多类回归时,输出层的节点个数变为类别数
二,如何理解神经网络
\(J(\Theta) = - \frac{1}{m} \sum_{t=1}^m \sum_{k=1}^K \left[ y^{(t)}_k \ \log (h_\Theta (x^{(t)}))_k + (1 - y^{(t)}_k)\ \log (1 - h_\Theta(x^{(t)})_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} ( \Theta_{j,i}^{(l)})^2 \)
求使代价函数 \( J(\Theta) \)最小的\(\Theta\)
既然代价函数和每个\(\theta\)都有关,和逻辑回归就很类似,就可用解决逻辑回归的梯度下降法解决,
1,那也就是初始化\(\Theta\)
2,求出关于每个代价函数\(J(\Theta)\)关于每个\(\Theta\)的偏导\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} \)
3,\(\Theta_{ij}^{(l)}\)减去\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} \)
4,重复1-3步直到\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} \) 近似为0
假设有如下神经网络图,输入层2个节点,隐藏层只有3节点, 输出层3节点
在神经网络中,计算\(h_{\Theta}(X)\)的叫作前向传播算法,计算\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} \)叫作后向传播算法
由链式求导法则得\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(l)}} = \frac{\partial J(\Theta)}{\partial h_{\Theta}(X)} \frac{\partial h_{\Theta}(X)}{\partial \Theta_{ij}^{(l)}} \)
我们先不考虑正则法,只考虑一组数据,那么此时
\( \frac{\partial J(\Theta)}{\partial h_{\Theta}(X)} = \frac{\partial - \sum_{k=1}^K \ \left[ y_k \log (h_\Theta (x))_k + (1 - y_k) \log (1 - h_\Theta(x)_k)\right] }{\partial h_{\Theta}(X)} = -\sum_{k=1}^K \frac{\partial [y_k \log (h_\Theta (x))_k + (1 - y_k) \log (1 - h_\Theta(x)_k)]}{\partial h_\Theta (x))_k} = \sum_{k=1}^K [\frac{ -y_k }{h_\Theta (x)_k} + \frac{1 - y_k} {1 - h_\Theta(x)_k}] = \sum_{k=1}^K [\frac{h_\Theta (x)_k - y_k}{h_\Theta (x)_k (1 - h_\Theta (x)_k)}]\)
接着要求\( \frac{\partial h_{\Theta}(X)}{\partial \Theta_{ij}^{(l)}}\),必须先求出 \( h_{\Theta}(X)\) 和\(\Theta_{ij}^{(l)}\)的关系
先用前向传导法计算该神经网络,计算过程如下
\( a^{(1)} = X\)
\(z^{(2)} = \Theta^{\left ( 1 \right )} a^{(1)}\)
- \(z_{1}^{(2)} = \Theta_{10}^{\left ( 1 \right )} a_{0}^{(1)} + \Theta_{11}^{\left ( 1 \right )} a_{1}^{(1)} + \Theta_{12}^{\left ( 1 \right )} a_{2}^{(1)} \)
- \(z_{2}^{(2)} = \Theta_{20}^{\left ( 1 \right )} a_{0}^{(1)} + \Theta_{21}^{\left ( 1 \right )} a_{1}^{(1)} + \Theta_{22}^{\left ( 1 \right )} a_{2}^{(1)} \)
- \(z_{3}^{(2)} = \Theta_{30}^{\left ( 1 \right )} a_{0}^{(1)} + \Theta_{31}^{\left ( 1 \right )} a_{1}^{(1)} + \Theta_{32}^{\left ( 1 \right )} a_{2}^{(1)} \)
\(a^{(2)} = g(z^{(2)})\)
\(z^{(3)} = \Theta^{\left ( 2 \right )} a^{(2)} \)
- \(z_{1}^{(3)} = \Theta_{10}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{11}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{12}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{13}^{\left ( 2 \right )} a_{3}^{(2)} \)
- \(z_{2}^{(3)} = \Theta_{20}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{21}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{22}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{23}^{\left ( 2 \right )} a_{3}^{(2)} \)
- \(z_{3}^{(3)} = \Theta_{30}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{31}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{32}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{33}^{\left ( 2 \right )} a_{3}^{(2)} \)
\(h_{\Theta}(X) = a^{(3)} = g(z^{(3)})\)
由前向传导算法可知,当l不同时,计算方法公式会不同,先计算l=2
\( \sum_{k = 1}^{K} \frac{\partial h_{\Theta}(X)_k}{\partial \Theta_{ij}^{(2)}} = \sum_{k = 1}^{K} [\frac{\partial h_{\Theta}(X)_k}{\partial z_k^{(3)}} \frac{\partial z_k^{(3)}}{\partial \Theta_{kj}^{(2)}}] = \sum_{k = 1}^{K} \frac{\partial g(z_k^{(3)})}{\partial z_k^{(3)}} \frac{\partial [\Theta_{k0}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{k1}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{k2}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{k3}^{\left ( 2 \right )} a_{3}^{(2)}]}{\partial \Theta_{ij}^{(2)}} = (1 - g(z_i^{(3)}))g(z_i^{(3)})a_j^{2} \)
当k ≠ i 时, \( \frac{\partial [\Theta_{k0}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{k1}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{k2}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{k3}^{\left ( 2 \right )} a_{3}^{(2)}]}{\partial \Theta_{ij}^{(2)}} = 0\)
合并之后\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(2)}} = \sum_{k = 1}^{K} [\frac{\partial J(\Theta)}{\partial h_{\Theta}(X)_k} \frac{\partial h_{\Theta}(X)_k}{\partial z_k^{(3)}} \frac{\partial z_k^{(3)}}{\partial \Theta_{kj}^{(2)}}] = \sum_{k=1}^K [\frac{h_\Theta (x)_k - y_k}{h_\Theta (x)_k (1 - h_\Theta (x)_k)} \frac{\partial g(z_k^{(3)})}{\partial z_k^{(3)}} \frac{\partial [\Theta_{k0}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{k1}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{k2}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{k3}^{\left ( 2 \right )} a_{3}^{(2)}]}{\partial \Theta_{ij}^{(2)}}] = \sum_{k=1}^K [\frac{g(z_k^{(3)}) - y_k}{g(z_k^{(3)}) (1 - g(z_k^{(3)}))}(1 - g(z_k^{(3)}))g(z_k^{(3)}) \frac{\partial [\Theta_{k0}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{k1}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{k2}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{k3}^{\left ( 2 \right )} a_{3}^{(2)}]}{\partial \Theta_{ij}^{(2)}} ] = (g(z_i^{(3)}) - y_i)a_j^{(2)} \)
设\( \Delta^{(l)} = \frac{\partial J(\Theta)}{\partial \Theta^{(l)}} \)
那么\(\Delta_{ij}^{(2)} = \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(2)}} = (g(z_i^{(3)}) - y_i)a_j^{(2)} = (a_i^{(3)} - y_i)a_j^{(2)}\)
\(\Delta^{(2)}\)的第i行第j列\(\Delta_{ij}^{(2)}\)由 \(a^{(3)} - y\)的第i个数和\(a^{(2)}\)的第j个数相乘得到,那么\(\Delta^{(2)} = (a^{(3)} - y) * (a^{(2)})^T\) (*表示矩阵相乘)
接下来计算\( \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(1)}} \)
为了避免重复计算,设\(\delta^{(3)} = a^{(3)} - y\)
\(\frac{\partial J(\Theta)}{\partial \Theta_{ij}^1} = \frac{\partial J(\Theta)}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial \Theta_{ij}^1} = \sum_{k=1}^{k=K}[\frac{\partial J(\Theta)}{\partial a_k^{(3)}}\frac{\partial a_k^{(3)}}{\partial z_k^{(3)}}\frac{\partial z_k^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial \Theta_{ij}^1}] = \sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)\frac{ \partial [\Theta_{k0}^{\left ( 2 \right )} a_{0}^{(2)} + \Theta_{k1}^{\left ( 2 \right )} a_{1}^{(2)} + \Theta_{k2}^{\left ( 2 \right )} a_{2}^{(2)} + \Theta_{k3}^{\left ( 2 \right )} a_{3}^{(2)}]}{\partial a_{i}^{(2)}} \frac{\partial a_{i}^{(2)}}{\partial z_{i}^{(2)}} \frac{\partial z_{i}^{(2)}}{\partial \Theta_{ij}^1}] = \sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)(\Theta_{ki}^{(2)}) g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)}] = g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} \sum_{k=1}^{k=K}[(a_k^{(3)}-y_k)(\Theta_{ki}^{(2)})] = g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} [((\Theta^{(2)})^T)_{i} * \delta^{(3)}] = [((\Theta^{(2)})^T)_{i} * (\delta^{(3)})]g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)}\)
\(\Delta_{ij}^{(1)} = (((\Theta^{(2)})^T)_{i} * \delta^{(3)})g(z_i^{(2)})(1 - g(z_i^{(2)}))a_j^{(1)} = (((\Theta^{(2)})^T)_{i} * \delta^{(3)})a_i^{(2)}(1 - a_i^{(2)})a_j^{(1)}\)
\(\Delta^{(1)}\)的第i行第j列\(\Delta_{ij}^{(1)}\)由 \([((\Theta^{(2)})^T) *\delta^{(3)}]a^{(2)}(1 - a^{(2)})\)的第i个数和\(a^{(1)}\)的第j个数相乘得到
为了避免重复计算,设\(\delta^{(2)} =((\Theta^{(2)})^T * \delta^{(3)})a^{(2)}(1 - a^{(2)}) \)
\(\Delta^{(1)} = \delta^{(2)} * (a^{(1)})^T \)
如果还有\(\Delta^{(0)}\),
观察\(\Delta^{(1)}\)的推导过程,发现\(\frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial \Theta_{ij}^1}\)不受k取值的影响
\(\Delta_{ij}^{(0)} = \frac{\partial J(\Theta)}{\partial \Theta_{ij}^{(0)}}=\sum_{k=1}^{k=K}[\frac{\partial J(\Theta)}{\partial a_k^{(3)}}\frac{\partial a_k^{(3)}}{\partial z_k^{(3)}}\frac{\partial z_k^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(0)}}\frac{\partial z^{(0)}}{\partial \Theta_{ij}^{(0)}}] = \delta^{(1)} \frac{\partial z^{(2)}}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(0)}}\frac{\partial z^{(0)}}{\partial \Theta_{ij}^{(0)}} = (((\Theta^{(1)})^T)_{i} * \delta^{(2)})a_i^{(1)}(1 - a_i^{(1)})a_j^{(0)} \)
.....(推导过程和\(\Delta^{(1)}\)类似,在此不赘述)
\(\Delta^{(0)} = \delta^{(1)} * (a^{(0)})^T \)
我们还可以加上梯度检查(Gradient checking)来验证\(\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)\)的计算方法是否正确
所谓的梯度检查用导数的定义计算导数
\(\theta^{(i+)}=\theta + \begin{bmatrix} 0\\ 0\\ ...\\ \epsilon \\ ...\\ 0 \end{bmatrix} \) ,\(\theta^{(i-)}=\theta - \begin{bmatrix} 0\\ 0\\ ...\\ \epsilon \\ ...\\ 0 \end{bmatrix} \)
按偏导的定义可得,\(f_i(\theta) \approx \frac{J(\theta^{(i+)}\ ) - J(\theta^{(i-)}\ )}{2\epsilon } (\epsilon = 1e-4) \)
如果\(f_i(\theta) \approx \frac \partial {\partial \Theta_{ij}^{(l)}}\),那算法是正确的,
一般先找一组数据分别运行一遍前向后向传播算法和偏导定义算法,如果近似相等,那么神经网络算法算法没写错,就可以运行神经网络算法了
总结:
运行一遍梯度检查算法和前向后向传播算法,检查前向后向有没有写错
for i = 1 to iteration(梯度下降次数下降次数一般大于10000)
1 对于每组训练数据 t =1 to m:
1),令\( a^{(1)} = x(t) \)
2),运用前向传导方法计算\(a^{(l)} (l = 2,3...L)\)
3),令\( \delta^{(L)} = a^{(l)} - y(t) \)
4),运用后向传导方法计算\(\delta^{(L-1)},\delta^{(L-2)}.....\delta^{(2)}\),\(\delta^{(l)} =((\Theta^{(l)})^T * \delta^{(l + 1)})a^{(l)}(1 - a^{(l)}) \)
5),\(\Delta_{i,j}^{(l)} = \Delta_{i,j}^{(l)} + \delta_{i}^{(l + 1)} a_j^{(l)} \Rightarrow \Delta^{(l)} = \Delta^{(l)} + \delta^{(l + 1)} * (a^{(l)})^T \)
2, 加上正则化
\(D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right) (j \neq 0)\)
\(D^{(l)}_{i,j} := \dfrac{1}{m} \Delta^{(l)}_{i,j} (j = 0)\)
3,\(\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta) = D^{(l)}_{i,j}\)
4,\(\Theta_{i,j}^{(l)} = \Theta_{i,j}^{(l)} - D^{(l)}_{i,j}\)