linear regression

LMS(最小均方差算法)

BGD vs SGD

\[min J(\theta )=\frac{1}{2}\sum_{i=1}^{m}\left ( h_{\theta}(x^{(i)})-y^{(i)} \right )^{2} \]

当仅有一个样本时：

\[\frac{\partial }{\partial \theta_j}J(\theta)=(h_\theta(x)-y)x_{j} \]

多样本时的更新算法：

BGD

Repeat until convergence:{
for every j:

\[\theta_{j}=\theta_{j}-\alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_{j}^{(i)} \]

}

SGD

Repeat until convergence:{
for i=1 to m {
for every j:

\[\theta_{j}=\theta_{j}-\alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_{j}^{(i)} \]

}
}

比较

method	原理	性能
BGD	用所有样本依次更新每一个参数	慢、占内存
SGD	每个样本都更新所有参数	快、常用

Newton's method

对于凸函数的代价函数最小化，除了SGD与BGD还有一个常用的算法：Newton's method
该方法的主要思想是每次学习的步长为$\Delta$(根据梯度得出)，而非固定学习率$\alpha$
以LMS凸函数示例算法过程
newthon method|center

当样本特征为多维的时候，$\theta$也是一个向量，这时的更新方式为：

\[\theta = \theta - H^{-1}\bigtriangledown_{\theta}l(\theta) \]

\[H_{ij}=\frac{\partial^2 l(\theta)}{{\partial \theta_i}{\partial \theta_j}} \]

Newton's method方法的缺点就在于：

海森矩阵的逆不一定存在，就算存在计算量也比较大
当n比特别大的时候，该算法不一定比SGD快

the normal equation

$\bigtriangledown _{A}f(A)$含义

f 本身代表一个关于矩阵的函数
- 表示f关于矩阵A的导数
- 其自变量为矩阵A
- 应变量为一个实数
$\bigtriangledown _{A}f(A)$是一个矩阵，矩阵的第i行j列的元素为f(A)关于$A_{ij}$的偏导数

normal equation的推导

tr operator:

\[trA=\sum_{i=1}^{n}A_{ii} \]

即矩阵A的迹为其对角线元素之和，为一个实数
2. 预备公式：

\[trA = tr A^{T} \]

\[traA=atrA \]

\[\bigtriangledown_{A^{T}}trABA^{T}C=B^{T}A^{T}C^{T}+BA^{T}C \]

推导

\[\bigtriangledown_{\theta}J(\theta)=0\rightarrow\theta=(X^{T}X)^{-1}X\vec{y} \]

cost function的概率解释

假设$\epsilon^{(i)}=y^{(i)}-\theta^{T}x^{(i)}$服从独立同分布的高斯分布
则

\[p(\vec{y}|X;\theta)=L(\theta|X,\vec{y})=L(\theta)=\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\delta}e^{\frac{(y^{(i)}-\theta^{T}x^{(i)})^2}{2\delta^{2}}} \]

函数	表达式	含义
概率函数	$p(\vec	X;\theta)$
似然函数	$L(\theta	X,\vec{y})$
求似然函数的最大值$\leftrightarrow$求概率函数的最大值，也$\leftrightarrow$求$\frac{1}{2}\sum_{i=1}^{m}(y^{i}-\theta^{T}x^{(i)})^{2}$的最小值(可推导)
但为什么要求概率函数的最大值呢？

使每一个样本尽可能预测准确$\leftrightarrow$使每一个$p(y^{(i)}|x^{(i)};\theta)$尽可能大
也可以从使每一个$\epsilon^{(i)}$尽可能接近于0的角度来理解

posted @ 2017-02-26 22:01 fariver 阅读(347) 评论(0) 编辑收藏举报

刷新页面返回顶部

fariver

linear regression