Machine Learning--week1 监督学习、预测函数、代价函数以及梯度下降算法

Supervised Learning
- given labelled data to train and used to predict
- for regression problem and classification problem
Unsupervised Learning
- derive structure from data where we don't necessarily know the effect of the variables
- no feedback based on the prediction results
- Clustering Algorism is just one type of Unsupervised Learning
- Cocktail Party Algorithm is non-clustering
差别在于：是否有监督（supervised），就看输入数据是否有标签（label）。输入数据有标签，则为有监督学习，没标签则为无监督学习。

分类和回归的区别在于输出变量的类型。

定量输出称为回归，或者说是连续变量预测；
定性输出称为分类，或者说是离散变量预测。

举个例子：

预测明天的气温是多少度，这是一个回归任务；
预测明天是阴、晴还是雨，就是一个分类任务。

训练出来的预测函数名通常取 h（== hypothesis(假设)），

how to represent h？ For example: $h_{\theta}(x) = \theta _{0} + \theta _{1}x$ , $\theta_{i}$ are the parameters of the model

调整$\theta_{i} $ $s.t.$ $\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}$ 尽可能小，其中M是训练集的样本容量。为了尽量减少平均误差，该求和式可以写为$\frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}$

$J(\theta_{0}, \theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}$ 就是所谓cost function(代价函数)，又称squared error function(平方误差函数)

整理一下：

Hypothesis:

\[h_{\theta}(x) = \theta _{0} + \theta _{1}x \]

Parameters:

\[\theta_{0},\theta_{1} \]

Cost Function:

\[J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2} \]

Goal:

\[\underbrace{\rm minimize}_{\theta_{0},\theta_{1}}\, J(\theta_{0},\theta_{1}) \]

用Gradient Descent Algorithm（梯度下降算法）去minimize the cost function

Gradient Descent Algorithm(伪代码)：

\[\text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1}) \qquad \text{(j = 0,1,2...)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]

其中$\alpha$是一个数字，被称为learning rate（学习速率），用于控制梯度下降的速率：

$\alpha$ is too small：gradient descent algorism can be too slow
$\alpha$ is too larger：gradient descent can overshoot the minimum and may even fail to converge or even may diverge

解释一下这个算法：

当$\theta_{j}\uparrow$会导致$J(\theta_0,\theta_1)\uparrow$时，偏导数$>0$，于是表达式的效果就是$\theta_j\downarrow$；而$\theta_{j}\uparrow$导致$J(\theta_0,\theta_1)\downarrow$时，偏导数$<0$,于是 $\theta_j\uparrow$

这样$\theta_j$就会逐渐向梯度为0的地方滑落

另外即使$\alpha$是一定的，在gradient descent 的过程中，$\theta_{j}$变化的幅度也是越来越小的，因为其偏导数趋向于0，所以没必要在靠近局部最优点时再额外减少$\alpha$

有一点需要注意：所有的$\theta_{i}$需要同时更新，因此我们不能直接用$ \theta_{0}; \text{:=} \theta_{0} - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})$这样的表达式，而应该写成：

\[temp_0\text{ := } \theta_0 - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})\\ temp_1\text{ := } \theta_1 - \alpha\frac{\partial}{\partial \theta_{1}}J(\theta_{0},\theta_{1})\\ \theta_{0}\; \text{:= } temp_0\qquad\qquad\qquad\qquad\quad\\ \theta_{1}\; \text{:= } temp_1\qquad\qquad\qquad\qquad\quad \]

这样才能避免第一和第二条式子中的$J(\theta_0,\theta_1)$不一致

gradient descent algorism 要求所有$\theta_{i}$同步更新

注：由不同的起点可能得到不同的局部最小点

将$J(\theta_0,\theta_1)$的表达式代入Gradient Descent Algorithm中时，可得到（伪代码）

\[\text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{1}\; \text{:= } \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\sdot x^{(i)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]

这种形式称为“Batch” Gradient Descent Algorithm：

“Batch”：Each step of gradient descent uses all the training examples

posted @ 2018-12-30 14:39 Khunkin 阅读(379) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Khunkin

Machine Learning--week1 监督学习、预测函数、代价函数以及梯度下降算法

公告