机器学习 - 第二讲

多元线性回归（Linear Regression with Multiple Variables）

多元（Multiple Features）

多元线性回归又称“multivariate linear regression”。

现在介绍一种可以表示任意输入变量的符号：

\(x_j^{(i)} = j \ 在 \ i^{th} \ 的训练样本的值。\)
\(x^{(i)} = 所有\ i^{th} \ 训练样本的值组成的列向量。 \)
\(m = 训练样本的数目 \)
\(n = |x^{(i)}|\ 特性的数量。\)

现在定义假设函数的多变量形式，包含以下各种参数：

\(h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n\)

为了便于理解，我们可以将 \(\theta_0\) 当作房子的基本价格，\(\theta_1\) 当作每平方米的价格，\(\theta_2\) 当作每层的价格等等。则 \(x_1\) 代表房子的平方米数，\(x_2\) 代表房子的层数等等。

使用矩阵乘法定义，我们的多元假设函数可以表示为：

\[\begin{align*} h_\theta(x) = \begin{bmatrix} \theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n \end{bmatrix} \begin{bmatrix} x_0 \newline x_1 \newline \vdots \newline x_n \end{bmatrix} = \theta^T x \end{align*} \]

这是一个向量化的假设函数的一个训练组。

训练样本逐行储存在 \(X\) 中，例如：

\[\begin{align*} X = \begin{bmatrix} x^{(1)}_0 & x^{(1)}_1 \newline x^{(2)}_0 & x^{(2)}_1 \newline x^{(3)}_0 & x^{(3)}_1 \end{bmatrix} & ,\theta = \begin{bmatrix} \theta_0 \newline \theta_1 \newline \end{bmatrix} \end{align*} \]

你可以将假设作为一个 \(（m \times 1）\) 的列向量计算：

\(h_\theta(X) = X \theta\)

成本函数（Cost function）

\(对于向量\ \theta （ \mathbb{R}^{n+1} 或者 \mathbb{R}^{(n+1) \times 1} ）\)，成本函数为：

\(J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2\)

使用向量计算：

\(J(\theta) = \frac{1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})\)

\(其中\ \vec{y}\ 表示所有\ y\ 的值。\)

多元梯度下降（Gradient Desent for Mulitple Variables）

梯度下降等式和之前的形式类似，我们只需要重复计算'n'个即可：

\[\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*} \]

也可以这么表示：

\[\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0..n} \newline \rbrace \end{align*} \]

矩阵符号（Matrix Notation）

梯度下降规则可以表示为：

\[\large \theta := \theta - \alpha \nabla J(\theta) \]

其中 \(\nabla J(\theta)\) 是列向量的表示形式：

\[\begin{align*} \; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac{1}{m} \sum\limits_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)} \newline \; & &=& \frac{1}{m} \sum\limits_{i=1}^{m} x_j^{(i)} \cdot \left(h_\theta(x^{(i)}) - y^{(i)} \right) \end{align*} \]

第 j 个梯度元素可以由两部分和构成：

有时，其中的两部分可以由两个向量表示。

\(其中的，x_j^{(i)} ，for i = 1，…，m 代表第 \ j列\ m\ 个元素，向量\ \vec{x_j}\ 取自训练组矩阵\ X\ 。\)

另一部分 \(\left(h_\theta(x^{(i)}) - y^{(i)} \right)是预测值 \ h_\theta(x^{(i)}) \ 和真实值 \ y^{(i)}\ 的偏差量的向量。重写\ \frac{\partial J(\theta)}{\partial \theta_j}\ ，我们得到：\)

\[\begin{align*} \; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) \newline \newline \newline \; &\nabla J(\theta) & = & \frac 1m X^{T} (X\theta - \vec{y}) \newline \end{align*} \]

最后，梯度下降规则的矩阵形式（向量化）表示为：

\(\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})\)

数值一般化（Feature Normalization）

我们可以通过将每个输入值粗略的控制到一个范围加速梯度下降。其中的原理是因为 \(\theta\) 在小的范围内下降速度很快，大的范围内下降速度很慢，因此当数据不均匀的时候可能会一直摆动不会降到最佳点。

防止这种情况发生的方法就是修正输入变量的范围让他们粗略的相同：

\(-1 \le x_i \le 1\)

或

\(-0.5 \le x_i \le 0.5\)

这些没有额外的需求，我们仅仅尝试加速计算。目标是让所有输入变量粗略的限制在这个范围。

有两种方法可以帮助我们：数值缩放（feature scaling）和均一化（mean normalization）。数值缩放通过输入变量的范围分割输入值，结果是产生一个新的范围为1的值。均一化则是减去输入变量的平均值，得到一个新的为0的平均值。可以通过一下公式使用这两种方法调整你的输入值：

\(x_i := \dfrac{x_i - \mu_i}{s_i}\)

其中 \(\mu_i\) 是所有值的平均值，\(s_i\) 是范围（最大值-最小值），或者 \(s_i\) 也可以是标准差（standard deviation）。

特性和多项式回归（Features and Polynomial Regression）

我们可以通过两种不同的方式改进假设函数的特性和形式。

我们可以将多个特性组合成一个。例如，我们可以将 \(x_1\ 和\ x_2\ 通过\ x_1 \cdot x_2\ 组合成一个新特性\ x_3。\)

多项式回归（Polynomial Regression）

我们的假设函数不必是线性的（一条直线）如果无法与数据匹配的很好。

我们可以改变行为或通过平方，立方或者平方根函数或者其他形式绘成曲线。

例如，如果我们的假设函数为\(h_\theta(x) = \theta_0 + \theta_1 x_1\) ，我们可以根据 \(x_1\) 创建一个新特性，得到二次函数 \(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2\) 或者三次函数 \(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3\)

在三次函数中，我们创建了新特性 \(x_2\ 和\ x_3\ ，其中\ x_2=x_1^2，x_3 = x_1^3\)

平方根函数：

\(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}\)

一般等式（Normal Equation）

一般等式是一种不需要迭代就可以寻找最适值的方法。

\(\theta = (X^T X)^{-1}XT y\)

使用一般等式时不需要数值缩放（feature scaling）。

梯度下降和一般等式的对比：

使用一般等式计算 \(\mathcal{O}(n^3)\) 逆矩阵会很复杂。所以如果我们由很多数量的特性时，一般等式会很慢。实践证明，n 小于 10,000 时用一般等式。

一般等式不可逆（Normal Equation Noninvertibility）

当使用一般等式时使用 'pinv' 函数，不使用 'inv' 。

posted @ 2016-09-06 20:08 ASH975 阅读(551) 评论(0) 收藏举报

刷新页面返回顶部