代码改变世界

---

2018-01-18 23:25  青火城城主  阅读(299)  评论(0编辑  收藏  举报

title: 第三周 浅层神经网络
date: 2017-10-18 09:58:33
tags:
- 深度学习

- deeplearning.ai

《神经网络与深度学习 -- 神经网络基础》 by Andrew Ng

本周介绍了浅层神经网络,所谓浅层主要是单隐层的神经网络。
介绍了神经网络的表示,向量化方法,激活函数(不同激活函数的表达和导数),最后讲了神经网络的梯度下降(前向传播和后向传播)。
基本还是在week2的基础上,进行了一些扩展。

3_2神经网络表示

最简单的神经网络,由输入层,一个隐藏层,输出层组成,一般不算入输入层,称为2层的神经网络。
\(a^{[0]}\)表示输入层,\(a^{[1]}\)表示隐藏层,\(a^{[2]}\)表示输出层。
\(a_2^{[1]}\)表示第1层第2个单元
这里输入层是3个变量,第1层的\(\omega\)是(4, 3)的,\(\omega\)的第1个值表示当前层有多少个隐藏单元,第2个值表示上一层有多少输入变量。

同理,\(\omega\)和\(b\)也是一样。以\(\omega^{[1]}\)为例,他是一个4 x 3的矩阵,4表示有4个隐藏单元(\(a^{[1]}_1, a^{[1]}_2, a^{[1]}_3, a^{[1]}_4\)),每个隐藏单元有自己的一套\(\omega\),3表示输入层有3个变量,所以每个隐藏单元需要3个参数。

3_3神经网络的输出

这里就可以看到,虽然\(\omega^{[1]}\)是(4,3)的,但是\(\omega_1^{[1]}\)却是(3,1)的。所以最终是要\(\omega_1^{[1]T}x\),也就是(1,3) * (3,1)。
向量化的表示方法,跟前面逻辑回归基本一致。

3_5向量化实现的解释

这里详细且直观的解释了向量化的实现。

3_6激活函数

这里介绍了常见的几种激活函数。并且说,除了逻辑回归的最终输出用sigmoid函数外,多数情况下,tanh函数总比sigmoid函数表现好。
ReLU(rectified linear function 修正线性单元)简单有效。

3_7为什么需要非线性激活函数

\(\color{maroon}{为什么我们需要非线性激活函数?}\)

因为神经网络本身就是上一层节点的线性组合,如果我们还用线性激活函数的话,那么其实就没必要多层神经网络,直接可以通过线性关系,做一层神经网络就可以了。图中右边解释了为什么不用线性激活函数,如果是线性激活函数,那么下一层节点的权重就只是\(\omega^{[1]} \times \omega^{[2]}\),那我们只需要一层网络,让他的\(\omega = \omega^{[1]} \times \omega^{[2]}\)即可。

3_8激活函数的导数1

3_8激活函数的导数2

3_8激活函数的导数3

3_9神经网络的梯度下降

反向传播总结:

\(\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{[2](i)} - y^{(i)})\)

\(\frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T} \)

\(\frac{\partial \mathcal{J} }{ \partial b_2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}\)

\(\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } = W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } * ( 1 - a^{[1] (i) 2}) \)

\(\frac{\partial \mathcal{J} }{ \partial W_1 } = \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } X^T \)

\(\frac{\partial \mathcal{J} _i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}\)

  • Note that \(*\) denotes elementwise multiplication.
  • The notation you will use is common in deep learning coding:
    • dW1 = \(\frac{\partial \mathcal{J} }{ \partial W_1 }\)
    • db1 = \(\frac{\partial \mathcal{J} }{ \partial b_1 }\)
    • dW2 = \(\frac{\partial \mathcal{J} }{ \partial W_2 }\)
    • db2 = \(\frac{\partial \mathcal{J} }{ \partial b_2 }\)

课后习题:

  1. A:为什么tanh大多数情况下比sigmoid好?
    Q: 因为tanh的输出的平均值更集中在0附近,因此对于下一层来说,他的数据集中化程度更高。

  2. A: Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?
    Q: Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.(这句话为什么是对的,没想通)

  3. Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if
    you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary
    because it will fail to “break symmetry”, True or False?
    A: False. Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros,
    the first example x fed in the logistic regression will output zero but the derivatives of the
    Logistic Regression depend on the input x (because there's no hidden layer) which is not
    zero. So at the second iteration, the weights values follow x's distribution and are
    different from each other if x is not a constant vector.

  4. You have built a network using the tanh activation for all the hidden units. You initialize the
    weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

A: This will cause the inputs of the tanh to also be very large, thus causing gradients to be
close to zero. The optimization algorithm will thus become slow.(Yes. tanh becomes 􀃓at for large values, this leads its gradient to be close to zero. This
slows down the optimization algorithm.)