【吴恩达】引言、线性回归、多项式回归

Machine Learning

Chapter1 Introduction

What is machine learning

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

某类任务T(task)具有性能度量P(performance),计算机程序可以从任务T的经验E(experience)中学习,提高性能P.

Machine learning algorithms can be divided into Supervised learning and Unsupervised learning.

Supervised learning

Given the label

  • Classification: Discrete valued output
  • regression: Continuous valued output

Unsupervised learning

Not given the label

  • cluster
  • cocktail party

Chapter2 Univariate Linear Regression

Model representation

Hypothesis function

\[h_\theta(x) = \theta_0+\theta_1 x \]

hypothesis: just a terminology. Here we assume the initial expression of the model is linear(easiest, can also be quadratic or multiple order).

Cost function

Square error function: The most commanly used one for regression problems.

\[J(\theta_0,\theta_1)=\frac{1}{2m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]^2 \]

m means the number of data sets, i.e. the number of points in the image.

Adding \(\frac{1}{2}\) only to get a easier function after derivation, which has no effect to the optimization results.

Goal: Choose \(\theta_0,\theta_1\) so that \(J(\theta_0,\theta_1)\) is minimum

\[\min_{\theta_0,\theta_1}J(\theta_0,\theta_1) \]

Each value group of \((\theta_0,\theta_1)\) corresponds to a different hypothesis function, and a different cost value.

\(J\) is a function with respect to \(\theta\), \(h_\theta\) is a function with respect to \(x\), the number of \(\theta\) is always more than the number of \(x\), so \(J\) always has a higher dimension than \(h_\theta\).

We use hypothesis function and original value to get cost function. Then, find the minimum cost value in cost function, so that we find the most suitable Hypothesis function to fit the original data.

Gradient descent

repeat until convergence{

for \(j=0\) and \(j=1\)

\[\theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) \]

\(\alpha\) is called learning rate. If \(\alpha\) is too small, gradient descent can be slow; If \(\alpha\) is too large, gradient descent can overshoot the minimum, which may fail to converge, or even diverge.

}

note: simultaneous update. We need two temporary variables to implement this.

Cost function for linear regression is always going to be a convex function. So we needn't worry about the local optima question.

Batch Gradient Descent: each step of gradient descent uses all the training examples.

Multivariate Linear Regression

Hypothesis function

\[\begin{aligned} h_\theta(x) &= \theta_0x_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_nx_n \\&=\theta^Tx \end{aligned} \\ \theta= \left[ \begin{matrix} \theta_0\\ \theta_1\\ \vdots\\ \theta_n \end{matrix} \right] ,x= \left[ \begin{matrix} x_0\\ x_1\\ \vdots\\ x_n \end{matrix} \right] (x_0=1) \]

n means the number of features,i.e. the number of axis in the image. In univariate linear regression, n=1.

Cost function

\[J(\theta)=\frac{1}{2m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]^2 \]

Unlike univariate linear regression, here \(\theta\) and \(x\) are both vectors rather than scalers.

Gradient descent

repeat until convergence{

\[\begin{aligned} \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\frac{\alpha}{m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]x_j^{(i)} \end{aligned} \]

(simultaneously update for every \(j=0,\cdots,n\))

}

In order to make gradient descent run much faster and converge in a lot fewer iterations, here are some tricks.

Feature scaling

If the value of one feature is far larger than the other feature, then the contour of the cost function will be a bunch of skinny ovals, the gradient descent will be very slow.

Get every feature into approximately a \(-1\le x_i\le1\) range.

Mean normalization

Replace \(x_i\) with \(x_i-\mu_i\) to make features have approximately zero mean. Do not apply to \(x_0\), because \(x_0 \equiv 1\), no matter how we transform.

Learning rate

Plot the figure of cost function with respect to the number of iterations, in order to check the convergence.

  • If the curve is increasing or oscillating, use smaller \(\alpha\)
  • If the curve decreases too slowly, use larger \(\alpha\)

Normal equation

m examples, n features.

\[x^{(i)}= \left[ \begin{matrix} x_0^{(i)}\\ x_1^{(i)}\\ x_2^{(i)}\\ \vdots\\ x_n^{(i)}\\ \end{matrix} \right]_{(n+1)\times1} , X= \left[ \begin{matrix} (x^{(1)})^T\\ (x^{(2)})^T\\ (x^{(3)})^T\\ \vdots\\ (x^{(m)})^T\\ \end{matrix} \right]_{m\times(n+1)} , y^{(i)}= \left[ \begin{matrix} y^{(1)}\\ y^{(2)}\\ y^{(3)}\\ \vdots\\ y^{(m)}\\ \end{matrix} \right]_{m\times1} \]

\[\theta=(X^TX)^{-1}X^Ty \]

No need to do feature scaling.

what if \(X^TX\) is non-invertible(singular or degenerate), although we run this pretty rarely?

  • Redundant features(linearly dependent). We should delete the redundant features
  • Too many features(\(m\le n\)).We should delete some features, or use regularization.

Gradient descent vs Normal equation

Gradient descent

  • Need to choose \(\alpha\). This means running it few times with different \(\alpha\) and seeing which works best.
  • Need many iterations. This may very slow in some cases(ellipse), and need to plot J of No. iterations to check the convergence.
  • Work well even when \(n\) is large.

Normal equation

  • No need to choose \(\alpha\)
  • No need to iterate.
  • Slow when \(n\) is large(\(n>10000\)). Need to compute \((X^TX)^{-1}\), which is an (n+1)\(\times\)(n+1) matrix. Compute matrix with large dimensions takes much time, the time complexity is \(o(n^3)\).

Chapter3 Polynomial regression

Hypothesis Function

\[\begin{aligned} h_\theta(x) &= \theta_0+\theta_1x+\theta_2x^2+\theta_3x^3 \\&= \theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3 \end{aligned} \]

\[x_1=x,x_2=x^2,x_3=x^3 \]

Transform the polynomial function to multivariate function. Remember to use feature scaling.

words and expressions

pseudo 伪

resort 求助

sophisticated 复杂的

quadratic 二次 cubic 三次

skinny 瘦的

ellipse oval 椭圆

skewed 歪曲的

malignant tumor 恶性肿瘤

benign tumor 良性肿瘤

lump 肿块

tragical 悲剧性的

slightly 轻微地

clump 簇、群

inventory 库存清单

cohesive 有结合力的

diabetes 糖尿病

hypothesis 假设

intuition 直觉

contour 等高线

colon 冒号

comma 逗号

whereas 鉴于、然而

susceptible 易受影响的

iterative 迭代的

commutative property 交换律

associative property 结合律

sloppy 马虎的,草率的

posted @ 2022-02-18 10:36  煮茶听风雨  阅读(86)  评论(0编辑  收藏  举报