Machine Learning--week1 监督学习、预测函数、代价函数以及梯度下降算法

  • Supervised Learning
    • given labelled data to train and used to predict
    • for regression problem and classification problem
  • Unsupervised Learning
    • derive structure from data where we don't necessarily know the effect of the variables
    • no feedback based on the prediction results
    • Clustering Algorism is just one type of Unsupervised Learning
    • Cocktail Party Algorithm is non-clustering
  • 差别在于:是否有监督(supervised),就看输入数据是否有标签(label)。输入数据有标签,则为有监督学习,没标签则为无监督学习。

分类和回归的区别在于输出变量的类型。

定量输出称为回归,或者说是连续变量预测;
定性输出称为分类,或者说是离散变量预测。

举个例子:

预测明天的气温是多少度,这是一个回归任务;
预测明天是阴、晴还是雨,就是一个分类任务。

训练出来的预测函数名通常取 h(== hypothesis(假设)),

how to represent h? For example: \(h_{\theta}(x) = \theta _{0} + \theta _{1}x\) , \(\theta_{i}\) are the parameters of the model

调整$\theta_{i} $ \(s.t.\) \(\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\) 尽可能小, 其中M是训练集的样本容量。为了尽量减少平均误差,该求和式可以写为\(\frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\)

\(J(\theta_{0}, \theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\) 就是所谓cost function(代价函数),又称squared error function(平方误差函数)

整理一下:

Hypothesis:

\[h_{\theta}(x) = \theta _{0} + \theta _{1}x \]

Parameters:

\[\theta_{0},\theta_{1} \]

Cost Function:

\[J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2} \]

Goal:

\[\underbrace{\rm minimize}_{\theta_{0},\theta_{1}}\, J(\theta_{0},\theta_{1}) \]

用Gradient Descent Algorithm(梯度下降算法)去minimize the cost function

Gradient Descent Algorithm(伪代码):

\[\text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1}) \qquad \text{(j = 0,1,2...)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]

其中\(\alpha\)是一个数字,被称为learning rate(学习速率),用于控制梯度下降的速率:

  • \(\alpha\) is too small:gradient descent algorism can be too slow
  • \(\alpha\) is too larger:gradient descent can overshoot the minimum and may even fail to converge or even may diverge

解释一下这个算法:

​ 当\(\theta_{j}\uparrow\)会导致\(J(\theta_0,\theta_1)\uparrow\)时,偏导数\(>0\),于是表达式的效果就是\(\theta_j\downarrow\);而\(\theta_{j}\uparrow\)导致\(J(\theta_0,\theta_1)\downarrow\)时,偏导数\(<0\),于是 \(\theta_j\uparrow\)

​ 这样\(\theta_j\)就会逐渐向梯度为0的地方滑落

​ 另外即使\(\alpha\)是一定的,在gradient descent 的过程中,\(\theta_{j}\)变化的幅度也是越来越小的,因为其偏导数趋向于0,所以没必要在靠近局部最优点时再额外减少\(\alpha\)

有一点需要注意:所有的\(\theta_{i}\)需要同时更新,因此我们不能直接用$ \theta_{0}; \text{:=} \theta_{0} - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})$这样的表达式,而应该写成:

\[temp_0\text{ := } \theta_0 - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})\\ temp_1\text{ := } \theta_1 - \alpha\frac{\partial}{\partial \theta_{1}}J(\theta_{0},\theta_{1})\\ \theta_{0}\; \text{:= } temp_0\qquad\qquad\qquad\qquad\quad\\ \theta_{1}\; \text{:= } temp_1\qquad\qquad\qquad\qquad\quad \]

这样才能避免 第一和第二条式子中的\(J(\theta_0,\theta_1)\)不一致

gradient descent algorism 要求所有\(\theta_{i}\)同步更新

注:由不同的起点可能得到不同的局部最小点

\(J(\theta_0,\theta_1)\)的表达式代入Gradient Descent Algorithm中时,可得到(伪代码)

\[\text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{1}\; \text{:= } \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\sdot x^{(i)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]

这种形式称为“Batch” Gradient Descent Algorithm:

​ “Batch”:Each step of gradient descent uses all the training examples

posted @ 2018-12-30 14:39  Khunkin  阅读(379)  评论(0编辑  收藏  举报