机器学习算法总结——回归
(1)线性回归
两个变量时,\({h_\theta }(x) = {\theta _0} + {\theta _1}{x_1} + {\theta _2}{x_2}\)
多个变量时,设\({x_0} = 1\),\({h_\theta }(x) = \sum\limits_{i = 0}^n {{\theta _i}{x_i} = {\theta ^T}x} \)(n为变量/特征个数)
a.使用极大似然估计解释最小二乘
\({\varepsilon ^{(i)}}\)表示估计值\({\theta ^T}{x^{(i)}}\)与真实值\({y^{(i)}}\)之间的差值。(i为第i个样本)
误差\({\varepsilon ^{(i)}}\)\(1 \le i \le m\)(其中m为样本数)是独立同分布的,服从均值为0,方差为某定值\({\sigma ^2}\)的高斯分布\({\varepsilon ^{(i)}} \sim N(0,{\sigma ^2})\)。因此真实值\({y^{(i)}}\)服从\({y^{(i)}} \sim N(\theta {x^{(i)}},{\sigma ^2})\)
多种因素生成\({\varepsilon ^{(i)}}\)均值不为0,通过调整\({\theta _0}\)将均值变为0。
误差服从正态分布的原因:中心极限定理——在自然界与生产中,一些现象受到许多相互独立的随机因素的影响,如果每个因素所产生的影响多很微小,总的影响可以看做服从正态分布。(注:应用前提是多个随机变量的和,有些问题是乘性相加,则需要鉴别或取对数后再使用)
\(\begin{array}{l}
p({\varepsilon ^{(i)}}) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp ( - \frac{{{{({\varepsilon ^{(i)}})}^2}}}{{2{\sigma ^2}}})\\
p({y^{(i)}}|{x^{(i)}};\theta ) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp ( - \frac{{{{({y^{(i)}} - {\theta ^T}{x^{(i)}})}^2}}}{{2{\sigma ^2}}})\\
L(\theta ) = \prod\limits_{i = 1}^m {p({y^{(i)}}|{x^{(i)}};\theta )}
\end{array}\)
\(l(\theta ) = \log L(\theta )\)
\({\rm{ = }}\sum\limits_{i = 1}^m {\log } \frac{1}{{\sqrt {2\pi } \sigma }}\exp ( - \frac{{{{({y^{(i)}} - {\theta ^T}{x^{(i)}})}^2}}}{{2{\sigma ^2}}})\)
\({\rm{ = mlog}}\frac{1}{{\sqrt {2\pi } \sigma }} - \frac{1}{{{\sigma ^2}}} \times \frac{1}{2}\sum\limits_{i = 1}^m {{{({y^{(i)}} - {\theta ^T}{x^{(i)}})}^2}} \)
\(J(\theta ) = \frac{1}{2}\sum\limits_{i = 1}^m {{{({\theta ^T}{x^{(i)}} - {y^{(i)}})}^2}} \)
b.解析式求解\(\theta \)
设矩阵\({X_{M \times N}}\),\(X\)的每一行对应一个样本,共M个样本。\(X\)的每一列对应样本的一个维度,共N维(还有额外一维常数项\(X_0\),全为1)。则\(X\)对应\({\theta _{N \times 1}}\),\({y_{M \times 1}}\)
\(J(\theta ) = \frac{1}{2}\sum\limits_{i = 1}^m {{{({h_\theta }({x^{(i)}}) - {y^{(i)}})}^2}} \)
\( = \frac{1}{2}{(X\theta - y)^T}(X\theta - y)\)
求梯度:
\({\nabla _\theta }J(\theta ) = \frac{\partial }{{\partial \theta }}[\frac{1}{2}{(X\theta - y)^T}(X\theta - y)]\)
\( = \frac{\partial }{{\partial \theta }}[\frac{1}{2}({\theta ^T}{X^T} - {y^T})(X\theta - y)]\)
\( = \frac{\partial }{{\partial \theta }}[\frac{1}{2}({\theta ^T}{X^T}X\theta - {\theta ^T}{X^T}y - {y^T}X\theta + {y^T}y)]\)
\( = \frac{1}{2}(2{X^T}X\theta - {X^T}y - {({y^T}X)^T})\)
\( = {X^T}X\theta - {X^T}y\)
另上式等于零求驻点,得\(\theta {\rm{ = (}}{X^T}X{{\rm{)}}^{ - 1}}{X^T}y\)
其中,用到的求导公式:\(\frac{{\partial ({\theta ^T}A\theta )}}{{\partial \theta }} = 2A\theta \)(A为对称阵),\(\frac{{\partial A\theta }}{{\partial \theta }} = {A^T} \Rightarrow \frac{{\partial {\theta ^T}A}}{{\partial \theta }} = A\)
若\({X^T}X\)不可逆或防止过拟合,增加\(\lambda \) 扰动\(\theta {\rm{ = (}}{X^T}X + \lambda I{{\rm{)}}^{ - 1}}{X^T}y\)
由于\({X^T}X\)为半正定:对于任意非零向量\({u^T}{X^T}Xu = {(Xu)^T}Xu\)令\({v = Xu}\),因此\({v^T}v \ge 0\)。所以对于任意实数\(\lambda > 0\),\({X^T}X + \lambda I\)正定,进而可逆。
c.梯度下降法求解\(\theta \)
步骤:1.初始化\(\theta \)(随机);
2.沿着负梯度方向迭代,更新后的\(\theta \)使\(J(\theta )\)更小,\({\theta _j}: = {\theta _j} - \alpha \frac{{\partial J(\theta )}}{{\partial \theta_j }}\),其中\(\alpha \)为学习率/步长
\(\frac{{\partial J(\theta )}}{{\partial {\theta _j}}} = \frac{\partial }{{\partial {\theta _j}}}\frac{1}{2}{({h_\theta }(x) - y)^2}\)
\( = 2 \times \frac{1}{2}({h_\theta }(x) - y)\frac{\partial }{{\partial {\theta _j}}}({h_\theta }(x) - y)\)
\( = ({h_\theta }(x) - y)\frac{\partial }{{\partial {\theta _j}}}(\sum\limits_{i = 0}^n {{\theta _i}{x_i} - y} )\)
\( = ({h_\theta }(x) - y){x_j}\)
批量梯度下降 \({\theta _j}: = {\theta _j} + \alpha \sum\limits_{i = 1}^m {({y^{(i)}} - {h_\theta }({x^{(i)}})){x_j}} \)
随机梯度下降 \({\theta _j}: = {\theta _j} + \alpha ({y^{(i)}} - {h_\theta }({x^{(i)}})){x_j}\) 对于i=1,...,m,拿到一个样本就更新。
mini-batch梯度下降 拿到若干个样本再进行更新。
线性回归中的线性指对于参数\(\theta \)为线性,而对于样本可以是非线性的。\(y = {\theta _0} + {\theta _1}x + {\theta _2}{x^2}\)
(2)局部加权回归
\(\sum\limits_{i = 1}^m {{w^{(i)}}{{({y^{(i)}} - {\theta ^T}{x^{(i)}})}^2}} \)
w可以选择\({w^{(i)}} = \exp \left( { - \frac{{{{({x^{(i)}} - x)}^2}}}{{2{\tau ^2}}}} \right)\) \(\tau \)为带宽,控制衰减速率。
构造损失函数时加入了权重w,对距离预测点较近的训练样本给以较高的权重,距离较远的训练样本给以较小的权重,权重取值范围是(0,1)
线性回归 | 局部加权回归 |
参数学习方法 | 非参数学习方法 |
有固定明确的参数,参数\(\theta \)一旦确定就不会改变了,不需要再保留训练集中的训练样本 | 每进行一次预测就需要学习一组\(\theta \),\(\theta \)是变化的,所以需要一直保留训练样本 |
(3)Logistic回归——解决分类问题
Logistic/sigmoid函数 \(g(z) = \frac{1}{{1 + {e^{ - z}}}}\)
\({h_\theta }(x) = g({\theta ^T}x) = \frac{1}{{1 + {e^{ - {\theta ^T}x}}}}\)
假设\(p(y = 1|x;\theta ) = {h_\theta }(x)\) \(p(y = 0|x;\theta ) = 1 - {h_\theta }(x)\)
因此,\(p(y|x;\theta ) = {\left( {{h_\theta }(x)} \right)^y}{\left( {1 - {h_\theta }(x)} \right)^{1 - y}}\)
\(L(\theta ) = P(\overrightarrow y |X;\theta )\)
\( = \prod\limits_{i = 1}^m {p({y^{(i)}}|{x^{(i)}};\theta )} \)
\( = \prod\limits_{i = 1}^m {{{\left( {{h_\theta }({x^{(i)}})} \right)}^{{y^{(i)}}}}{{\left( {1 - {h_\theta }({x^{(i)}})} \right)}^{1 - {y^{(i)}}}}} \)
\(l(\theta ) = log L(\theta ) = \sum\limits_{i = 1}^m {{y^{(i)}}\log {h_\theta }({x^{(i)}}) + (1 - {y^{(i)}})\log \left( {1 - {h_\theta }({x^{(i)}})} \right)} \)
\(\frac{\partial }{{\partial {\theta _j}}}l(\theta ) = \left( {y\frac{1}{{g({\theta ^T}x)}} - (1 - y)\frac{1}{{1 - g({\theta ^T}x)}}} \right)\frac{\partial }{{\partial {\theta _j}}}g({\theta ^T}x)\)
\( = \left( {y\frac{1}{{g({\theta ^T}x)}} - (1 - y)\frac{1}{{1 - g({\theta ^T}x)}}} \right)g({\theta ^T}x)(1 - g({\theta ^T}x))\frac{\partial }{{\partial {\theta _j}}}{\theta ^T}x\)
\( = \left( {y(1 - g({\theta ^T}x)) - (1 - y)g({\theta ^T}x)} \right){x_j}\)
\( = \left( {y - yg({\theta ^T}x) - g({\theta ^T}x) + yg({\theta ^T}x)} \right){x_j}\)
\( = \left( {y - {h_\theta }(x)} \right){x_j}\)
其中,\(g'(x) = \frac{{{e^{ - x}}}}{{{{(1 + {e^{ - x}})}^2}}}\)
\( = \frac{1}{{1 + {e^{ - x}}}} \times \frac{{{e^{ - x}}}}{{1 + {e^{ - x}}}}\)
\( = \frac{1}{{1 + {e^{ - x}}}}\left( {1 - \frac{1}{{1 + {e^{ - x}}}}} \right)\)
\( = g(x)\left( {1 - g(x)} \right)\)
利用梯度上升法求\(\theta \) \({\theta _j}: = {\theta _j} + \alpha (y - {h_\theta }(x)){x_j}\)与线性回归求\(\theta \)具有相同的形式
几率odds:该事件发生的概率与不发生的概率的比值。
\(p(y = 1|x;\theta ) = {h_\theta }(x)\) \(p(y = 0|x;\theta ) = 1 - {h_\theta }(x)\)
对数几率 \(\log it(p) = \log \frac{p}{{1 - p}}\)
\( = \log \frac{{{h_\theta }(x)}}{{1 - {h_\theta }(x)}}\)
\( = \log \left( {{\textstyle{{{\textstyle{1 \over {1 + {e^{ - {\theta ^T}x}}}}}} \over {{\textstyle{{{e^{ - {\theta ^T}x}}} \over {1 + {e^{ - {\theta ^T}x}}}}}}}}} \right)\)
\( = {\theta ^T}x\)
Logistic回归是对数线性模型
Logistic回归的两种损失函数(负对数损失):\(\left\{ \begin{array}{l}
{y_i} \in \{ 0,1\} \\
{y_i} \in \{ - 1,1\}
\end{array} \right.\)
\({y_i} \in \{ 0,1\} \)时,
\(L(\theta ) = \prod\limits_{i = 1}^m {p_i^{{y_i}}{{(1 - {p_i})}^{1 - {y_i}}}} \)
\(l(\theta ) = \sum\limits_{i = 1}^m {\ln \left[ {p_i^{{y_i}}{{(1 - {p_i})}^{1 - {y_i}}}} \right]} \)
\({p_i} = \frac{1}{{1 + {e^{ - {f_i}}}}}\) \(1 - {p_i} = \frac{1}{{1 + {e^{{f_i}}}}}\)
\(l(\theta ) = \sum\limits_{i = 1}^m {\ln \left[ {{{\left( {\frac{1}{{1 + {e^{ - {f_i}}}}}} \right)}^{{y_i}}}{{\left( {\frac{1}{{1 + {e^{{f_i}}}}}} \right)}^{1 - {y_i}}}} \right]} \)
\(loss({y_i},{{\hat y}_i}) = - l(\theta )\)
\( = \sum\limits_{i = 1}^m {\left[ {{y_i}\ln \left( {1 + {e^{ - {f_i}}}} \right) + (1 - {y_i})\ln \left( {1 + {e^{{f_i}}}} \right)} \right]} \)
\({y_i} \in \{ -1,1\} \)时,
\({\rm{L}}(\theta ) = \prod\limits_{i = 1}^m {p_i^{{\textstyle{{1 + {y_i}} \over 2}}}{{(1 - {p_i})}^{ - {\textstyle{{{y_{i - 1}}} \over 2}}}}} \)
\(l(\theta ) = \sum\limits_{i = 1}^m {\ln \left[ {{{\left( {\frac{1}{{1 + {e^{ - {f_i}}}}}} \right)}^{{\textstyle{{1 + {y_i}} \over 2}}}}{{\left( {\frac{1}{{1 + {e^{{f_i}}}}}} \right)}^{ - {\textstyle{{{y_{i - 1}}} \over 2}}}}} \right]} \)
\(loss({y_i},{{\hat y}_i}) = - l(\theta )\)
\( = \sum\limits_{i = 1}^m {\left[ {\frac{1}{2}(1 + {y_i})\ln (1 + {e^{ - {f_i}}}) - \frac{1}{2}({y_i} - 1)\ln (1 + {e^{{f_i}}})} \right]} \)
\({y_i} = 1\) 时,\(\sum\limits_{i = 1}^m {\left[ {\ln (1 + {e^{ - {f_i}}})} \right]} \)
\({y_i} = -1\) 时,\(\sum\limits_{i = 1}^m {\left[ {\ln (1 + {e^{{f_i}}})} \right]} \)
合并后得,\(\sum\limits_{i = 1}^m {\left[ {\ln (1 + {e^{ - {y_i}{f_i}}})} \right]} \)
指数分布族:可以表示为指数形式的概率分布。
\(p(y;\eta ) = b(y)\exp ({\eta ^T}T(y) - a(\eta ))\)
公式中\(y\)是随机变量;\(\eta \)称为分布的自然参数,也称为标准参数;\(T(y)\)称为充分统计量,通常\(T(y) = y\);\(a(\eta )\)称为对数分割函数。
\({e^{ - a(\eta )}}\)本质上是一个归一化常数,确保\(p(y;\eta \)概率和为1。
当参数a,b,T都固定时,就定义了一个以\(\eta \)为参数的函数族。
伯努利分布属于指数分布族:
\(p(y;\varphi ) = {\varphi ^y}{(1 - \varphi )^{1 - y}}\) \(y \in \{ 0,1\} \)
\( = \exp (y\log \varphi + (1 - y)\log (1 - \varphi ))\)
\( = \exp \left( {\log \left( {\frac{\varphi }{{1 - \varphi }}} \right)y + \log \left( {1 - \varphi } \right)} \right)\)
其中,\(b(y) = 1\) \(T(y) = y\) \(\eta = \log \left( {\frac{\varphi }{{1 - \varphi }}} \right) \to \varphi = \frac{1}{{1 + {e^{ - \eta }}}}\) \(a(\eta ) = - \log \left( {1 - \varphi } \right) = \log (1 + {e^\eta })\)
高斯分布也属于指数分布族:
\(N(\mu ,1) = \frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{1}{2}{{(y - \mu )}^2}} \right)\)
\( = \frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{1}{2}{y^2} - \frac{1}{2}{\mu ^2} + \mu y} \right)\)
\( = \frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{1}{2}{y^2}} \right)\exp \left( {\mu y - \frac{1}{2}{\mu ^2}} \right)\)
其中,\(b(y) = \frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{1}{2}{y^2}} \right)\) \(T(y) = y\) \(\eta = \mu \) \(a(\eta ) = \frac{1}{2}{\mu ^2}\)
凡是符合指数族分布的随机变量都可以用广义线性模型(GLM)回归分析。
上述中,\(\eta \)与伯努利分布中的参数\(\varphi \)是Logistic函数;\(\eta \)与正态分布的参数\(\mu \)的关系相等。通过这两个例子得,\(\eta \)以不同的映射函数与其他概率分布函数中的参数发生关系,从而得到不同的模型,GLM正是将指数分布族中的所有成员都作为线性模型的扩展,通过各种非线性连接函数将线性函数映射到其他空间从而大大扩展了线性模型可解决的问题。
变量\(x \to g(x) \to y\) \(g(x)\)是连接函数
构造广义线性模型需要三个假设:
1.\(y|x;\theta \sim ExpFamily(\eta )\):给定样本\(x\)与参数\(\theta \),样本\(y\)服从指数分布族中的某个分布。
2.给定\(x\),我们的目标是要确定\(T(y)\),即\(h(x) = E\left[ {T(y)|x} \right]\)。大多数情况下\(T(y) = y\),那么我们实际要确定的是\(h(x) = E\left[ {y|x} \right]\)。
3.假设自然参数\(eta \)和\(x\)是线性相关的,即假设\(\eta = {\theta ^T}x\)。
根据这三个假设可以推导出Logistic模型与最小二乘模型。
推导Logistic模型:
\({h_\theta }(x) = E\left[ {T(y)|x} \right] = E\left[ {y|x} \right]\)
\( = p(y = 1|x;\theta ) = \varphi \)
\( = \frac{1}{{1 + {e^{ - \eta }}}} = \frac{1}{{1 + {e^{ - {\theta ^T}x}}}}\)
推导最小二乘模型:
\({h_\theta }(x) = E\left[ {T(y)|x} \right] = E\left[ {y|x} \right]\)
\( = \mu = \eta = {\theta ^T}x\)
因此,广义线性模型通过假设一个概率分布得到不同的模型,之后再利用梯度下降、牛顿方法求解\(\theta \)
(4)Softmax回归
多项式推导出的GLM可以解决多分类问题,是Logistic模型的扩展。应用:邮件分类、预测病人疾病等。
\(y \in \{ 1,2,3, \cdots ,k\} \)
\(p(y = i) = {\varphi _i}\) 并且满足\(\sum\limits_{i = 1}^k {{\varphi _i}} = 1\)
\({\varphi _k}\) 可以用前\(k - 1\)个变量来表示 \({\varphi _k} = 1 - \sum\limits_{i = 1}^{k - 1} {{\varphi _i}} \)
为使多项式分布可以写成指数族分布的形式,首先定义\(T(y)\):
\(T(1) = \left[ \begin{array}{l}
1\\
0\\
0\\
\vdots \\
0
\end{array} \right]{\rm{, }}T(2) = \left[ \begin{array}{l}
0\\
1\\
0\\
\vdots \\
0
\end{array} \right]{\rm{,}} \cdots {\rm{ }}T(k - 1) = \left[ \begin{array}{l}
0\\
0\\
0\\
\vdots \\
1
\end{array} \right],T(k) = \left[ \begin{array}{l}
0\\
0\\
0\\
\vdots \\
0
\end{array} \right]\)
同时,引入示性函数\(I\),使得:\(I(True) = 1\) \(I(False) = 0\)
\(T(y)\)向量中的某个元素还可以表示成 \(T{(y)_i} = I(y = i)\)
比如,当\(y = 2\)时,\(T{(2)_2} = I(2 = 2) = 1\) \(T{(2)_3} = I(2 = 3) = 0\)
因此,\(E[T{(y)_i}] = \sum\limits_{y = 1}^k {T{{(y)}_i}{\varphi _i}} = \sum\limits_{y = 1}^k {I(y = i){\varphi _i}} = {\varphi _i}\)
即,\(E[T{(y)_i}] = p(y = i) = {\varphi _i}\)
多项式分布转化为指数分布族表达式:
\(p(y;\varphi ) = \varphi _1^{I(y = 1)}\varphi _2^{I(y = 2)} \cdots \varphi _k^{I(y = k)}\)
\( = \varphi _1^{I(y = 1)}\varphi _2^{I(y = 2)} \cdots \varphi _k^{1 - \sum\limits_{i = 1}^{k - 1} {I(y = i)} }\)
\( = \varphi _1^{T{{(y)}_1}}\varphi _2^{T{{(y)}_2}} \cdots \varphi _k^{1 - \sum\limits_{i = 1}^{k - 1} {T{{(y)}_i}} }\)
\( = \exp \left( {T{{(y)}_1}\log {\varphi _1} + T{{(y)}_2}\log {\varphi _2} + \cdots + (1 - \sum\limits_{i = 1}^{k - 1} {T{{(y)}_i}} )\log {\varphi _k}} \right)\)
\( = \exp \left( {T{{(y)}_1}\log \frac{{{\varphi _1}}}{{{\varphi _k}}} + T{{(y)}_2}\log \frac{{{\varphi _2}}}{{{\varphi _k}}} + \cdots + T{{(y)}_{k - 1}}\log \frac{{{\varphi _{k - 1}}}}{{{\varphi _k}}} + \log {\varphi _k}} \right)\)
\( = b(y)\exp \left( {{\eta ^T}T(y) - \alpha (\eta )} \right)\)
其中, \(b(y) = 1\) \(\alpha (\eta ) = - \log {\varphi _k}\) \(\eta = \left[ \begin{array}{l}
\log \frac{{{\varphi _1}}}{{{\varphi _k}}}\\
\log \frac{{{\varphi _2}}}{{{\varphi _k}}}\\
\vdots \\
\log \frac{{{\varphi _{k - 1}}}}{{{\varphi _k}}}
\end{array} \right]\)
多项分布表达式可以表示为指数分布族表达式的形式,所以它属于指数分布族,那么可以用广义线性模型来拟合多项式分布模型。
\({\eta _i} = \log \frac{{{\varphi _i}}}{{{\varphi _k}}}\) 并且令 \({\eta _k} = \log \frac{{{\varphi _k}}}{{{\varphi _k}}} = 0\)
那么,\({e^{{\eta _i}}} = \frac{{{\varphi _i}}}{{{\varphi _k}}} \to {\varphi _k}{e^{{\eta _i}}} = {\varphi _i} \to {\varphi _k}\sum\limits_{i = 1}^k {{e^{{\eta _i}}}} = \sum\limits_{i = 1}^k {{\varphi _i}} = 1\)
因为 \({\varphi _k} = \frac{1}{{\sum\limits_{i = 1}^k {{e^{{\eta _i}}}} }}\) 所以, \({\varphi _i} = \frac{{{e^{{\eta _i}}}}}{{\sum\limits_{i = 1}^k {{e^{{\eta _i}}}} }}\)
这个\({\varphi _i}\)关于\({{\eta _i}}\)的函数称为Softmax函数。
根据广义线性模型假设3:
\({\eta _i} = \theta _i^Tx\)(for \(i = 1, \cdots ,k - 1\))
定义\({\theta _k} = 0\),所以\({\eta _k} = \theta _k^Tx = 0\)
所以模型在给定\(x\)的条件下\(y\)的分布\(p(y = i|x;\theta )\)为:
\(p(y = i|x;\theta ) = {\varphi _i}\)
\( = \frac{{{e^{{\eta _i}}}}}{{\sum\limits_{i = 1}^k {{e^{{\eta _i}}}} }}\)
\( = \frac{{{e^{\theta _i^Tx}}}}{{\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} }}\)
\({h_\theta }(x) = E[T(y)|x;\theta ]\)
\( = \left[ \begin{array}{l}
{\varphi _1}\\
{\varphi _2}\\
\vdots \\
{\varphi _{k - 1}}
\end{array} \right]\)
将上边\({\varphi _i}\)表达式带入即得到\({h_\theta }(x)\)的表达式。
接下来利用最大似然函数的方法求得参数\(\theta \)。
\(L(\theta ) = \prod\limits_{i = 1}^m {p({y^{(i)}}|{x^{(i)}};\theta )} \) m为样本数,k为类别数
\( = \prod\limits_{i = 1}^m {\prod\limits_{j = 1}^k {\varphi _j^{I({y^{(i)}} = j)}} } \)
\(l(\theta ) = \sum\limits_{i = 1}^m {\sum\limits_{j = 1}^k {I({y^{(i)}} = j)\log } } {\varphi _j}\)
\(l(\theta ) = \sum\limits_{i = 1}^m {\left( {I({y^{(i)}} = j)\log {\varphi _j} + \sum\limits_{s \ne j}^k {I({y^{(i)}} = s)\log } {\varphi _s}} \right)} \)
\(\frac{\partial }{{\partial {\theta _j}}}l(\theta ) = \sum\limits_{i = 1}^m {\left( {\frac{{I({y^{(i)}} = j)}}{{{\varphi _j}}}\frac{\partial }{{\partial {\theta _j}}}{\varphi _j} + \sum\limits_{s \ne j}^k {\frac{{I({y^{(i)}} = s)}}{{{\varphi _s}}}\frac{\partial }{{\partial {\theta _j}}}} {\varphi _s}} \right)} \)
\(\frac{\partial }{{\partial {\theta _j}}}{\varphi _j} = \frac{{{e^{\theta _j^Tx}}\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} - {{\left( {{e^{\theta _j^Tx}}} \right)}^2}}}{{{{\left( {\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} } \right)}^2}}}\frac{\partial }{{\partial {\theta _j}}}\left( {{e^{\theta _j^Tx}}} \right)\)
\( = \frac{{{e^{\theta _j^Tx}}}}{{\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} }} - \frac{{{{\left( {{e^{\theta _j^Tx}}} \right)}^2}}}{{{{\left( {\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} } \right)}^2}}}x\)
\( = {\varphi _j}(1 - {\varphi _j})x\)
\(\frac{\partial }{{\partial {\theta _j}}}{\varphi _s} = - \frac{{{e^{\theta _s^Tx}}{e^{\theta _j^Tx}}}}{{{{\left( {\sum\limits_{d = 1}^k {{e^{\theta _d^Tx}}} } \right)}^2}}}\frac{\partial }{{\partial {\theta _j}}}\left( {{e^{\theta _j^Tx}}} \right)\)
\( = - {\varphi _s}{\varphi _j}x\)
因此,\(\frac{\partial }{{\partial {\theta _j}}}l(\theta ) = \sum\limits_{i = 1}^m {\left( {I({y^{(i)}} = j)(1 - {\varphi _j}) - \sum\limits_{s \ne j}^k {I({y^{(i)}} = s){\varphi _j}} } \right)} {x^{(i)}}\)
\( = \sum\limits_{i = 1}^m {{x^{(i)}}\left( {I({y^{(i)}} = j)(1 - {\varphi _j}) - (1 - I({y^{(i)}} = j)){\varphi _j}} \right)} \)
\( = \sum\limits_{i = 1}^m {{x^{(i)}}\left( {I({y^{(i)}} = j) - {\varphi _j}} \right)} \)
之后根据梯度上升算法可得参数\(\theta \),使用假设函数\({h_\theta }(x)\)对新的样例进行预测,即可完成多分类任务。
注:Logistic/Softmax回归是实践中解决分类问题的重要方法,方法简单,易实现,效果良好,易于解释。还可用于推荐系统中。
特征选择很重要,除人工选择外,还可以用其他机器学习方法,如随机森林、PCA、LDA等。
梯度下降算法是参数优化的重要手段,尤其SGD:适用于在线学习,跳出局部极小值。
学习率α的选择问题:根据经验,可以从以下几个数值开始试验α的值,0.001 ,0.003, 0.01, 0.03, 0.1, 0.3, 1, …α初始值为0.001, 不符合预期乘以3倍用0.003代替,不符合预期再用0.01替代,如此循环直至找到最合适的α。然后对于这些不同的 α 值,绘制 J(θ)随迭代步数变化的曲线,然后选择看上去使得 J(θ)快速下降的一个 α 值。所以,在为梯度下降算法选择合适的学习速率 α 时,可以大致按3的倍数再按10的倍数来选取一系列α值,直到我们找到一个值它不能再小了,同时找到另一个值,它不能再大了。其中最大的那个 α 值,或者一个比最大值略小一些的α 值 就是我们期望的最终α 值。