机器学习相关规定
machine learning ≈ looking for function
this course focuses on deep learning
deep learning's function is 类神经网络
Step1:function with unknown
Step2:define loss from training data
Step3:optimization
弹性化function之后得$y=b+\sum \limits_{i} c_{i} \text { sigmoid }\left(b_{i}+\sum \limits_{j} w_{i j} x_{j}\right)$
其中$\alpha_1 = sigmoid(r_1)=\frac{1}{1+e^{-r_1}}$
$\theta$代表所有未知参数,$L(\theta)$为Loss
$\boldsymbol{\theta}^{*}=\arg \min \limits_{\theta} L$
$\boldsymbol{\theta}=\left[\begin{array}{c}
\theta_{1} \\
\theta_{2} \\
\theta_{3} \\
\vdots
\end{array}\right]$
optimization过程:
(Randomly)Pick initial values $\theta^0$
Compute gradient $\boldsymbol{g}=\nabla L\left(\boldsymbol{\theta}^{0}\right) \cdots$
g即梯度(gradient)
$\boldsymbol{g}=\nabla L\left(\boldsymbol{\theta}^{0}\right)$
接下来update一下
$\left[\begin{array}{c}
\theta_{1}^{1} \\
\theta_{2}^{1} \\
\vdots
\end{array}\right] \leftarrow\left[\begin{array}{c}
\theta_{1}^{0} \\
\theta_{2}^{0} \\
\vdots
\end{array}\right]-\left[\begin{array}{c}
\left.\eta \frac{\partial L}{\partial \theta_{1}}\right|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{0}} \\
\left.\eta \frac{\partial L}{\partial \theta_{2}}\right|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{0}} \\
\vdots
\end{array}\right]$
简写为$\boldsymbol{\theta}^{1} \leftarrow \boldsymbol{\theta}^{0}-\eta \boldsymbol{g}$
$\eta$即为学习率
但实际上我们使用batch来update,所有的batch看过一遍叫一次epoch
如下
也可以使用relu作为激活函数,可以使用两个relu来合成一个hard sigmoid
也可以使用多层,即对$\alpha$再次计算
Regularization
正则化可以防止过拟合,更加平滑,抗噪音干扰
$\lambda$越大,越平滑,training的error越大,这很合理,因为$\lambda$越大,我们就越倾向于考虑w本身的值,但是在testing的error会变小又变大(因为太平滑的话就是水平线)
做regularization时候,可以不用加bias,因为不影响平滑