拟牛顿法

1 拟牛顿法

牛顿法:$$x_{k+1} = x_k - \nabla^2 f(x_k)^{-1} \nabla f(x_k)$$
牛顿法的缺陷在于 \(\nabla^2 f(x_k)\) 难以求取和存储,因此有拟牛顿法(Quasi-Newton methods),通过在牛顿法的迭代中加入近似求取每一步\(Hessian\)矩阵的迭代步,仅通过迭代点处的梯度信息来求取\(Hessian\)矩阵的近似值。

我们把\(Hessian\)矩阵在第\(k\)步的迭代的近似值记作\(B_k\),通过\(B_k\)代替\(\nabla^2 f(x_k)\),得到新的迭代式:$$x_{k+1} = x_k - B_k^{-1} \nabla f(x_k)$$

为了保证\(B_{k+1}\)能较好地近似第\(k+1\)步的\(Hessian\)矩阵,令\(H_k = B_k^{-1}\),可以得到以下两个式子:

\[\begin{aligned} x_{k+1} = x_k - H_k \nabla f(x_k) \\ H_{k+1} = f(H_k) \end{aligned}\]

注意: 拟牛顿法是一种思想而不是具体方法,只要能满足上述迭代式并且收敛,那就是拟牛顿法

2 算法框架

\(H_0 = I\),给出迭代初始值\(x_0\)和迭代停止阈值\(\epsilon > 0\)

while(1) {
计算梯度:\(\nabla f(x_k)\);
if(\(\| \nabla f(x_k) \|\) < \(\epsilon\))
break;
计算搜索方向:\(d_k = -H_k \nabla f(x_k)\);
更新迭代点:\(x_{k+1} = x_k - d_k\);
根据\(x_k\)处的信息更新\(H_k\);
}

为了观察\(f(x_{k+1})和f(x_k)\)的差值,使用泰勒展开:$$f(x_{k+1}) = f(x_k) + g_k^T d + o(|d|^2)$$
这里默认\(g_k^T<0\),但因为这里\(x_k\)\(g_k\)(迭代点\(x_k\)处的梯度)固定,只能改变 d,因此问题转化为$$\max_d |g_k^Td|$$
\(\cos<g_k, d>\)固定时,只要\(\|d\|^2\)够大,\(|g_k^Td|\)会变成\(\infty\),需要进行约束。规定\(\|d\|_{B_k}\), 定义为\(\|d^TB_kd\|\),则:$$\max_d |g_k^Td| \qquad s.t. |d|_{B_k} = 1$$
\(Cauchy-Schwartz\)不等式,

\[|g_k^Td| = \sqrt{(g_k^Td)^2} \leq \sqrt{(g_k^TB_k^{-1}g_k)(d^TB_kd)} = \sqrt{g_k^TB_k^{-1}g_k} \]

当且仅当\(d = -B_k^{-1}g_k\), "="

3 如何确定\(H_k\)的迭代更新?

\(s_k = x_{k+1} - x_k, y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\),则$$B_{k+1}s_k = y_k$$

\[H_{k+1}y_k = s_k \]

SR1

根据\(x_k\)处的信息得到一个修正量\(\Delta H_k\)来更新:$$H_{k+1} = H_k + \Delta H_k$$

由于我们希望\(H_k\)接近\(\nabla^2 f(x_k)^{-1}\), \(H_{k+1}\)接近\(\nabla^2 f(x_{k+1})^{-1}\),有 $$\Delta H_k = \nabla2f(x_{k+1}) - \nabla2f(x_k)$$
由于\(\nabla^2f(x_{k+1})^{-1} 和 \nabla^2f(x_k)^{-1}\)对称,所以\(\Delta H_k\)对称。

该算法设\(\delta H_k = \beta uu^T\),那么迭代式为$$H_{k+1} = H_k + \beta uu^T \tag{1}$$
其中\(\beta \in \mathbb{R}, u \in \mathbb{R}^n\)

\((1)\)两边乘上\(y_k\),再根据\(H_{k+1}y_k = s_k\),有$$H_{k+1}y_k = H_ky_k + \beta uu^T y_k = s_k$$

\[\rightarrow s_k - H_ky_k = \beta (u^Ty_k)u \tag{2} \]

\(\beta (u^T y_k)\)是实数,有 $$u = \frac{1}{\beta (u^T y_k)} (s_k - H_ky_k)$$
\(\gamma = \frac{1}{\beta (u^T y_k)}\),所以$$u = \gamma (s_k - H_ky_k) \tag{3}$$
(2)可以表示成\(s_k - H_ky_k = \beta uu^T y_k\),将(3)代入,有

\[\begin{aligned} s_k - H_ky_k & = \beta \gamma^2 (s_k - H_ky_k)(s_k - H_ky_k)^Ty_k \\ & = \beta \gamma^2 (s_k - H_ky_k)[(s_k - H_ky_k)^Ty_k] \end{aligned} \tag{4} \]

注意到\(\beta \gamma^2 \in \mathbb{R}\)\((s_k - H_ky_k)^Ty_k \in \mathbb{R}\),所以$$s_k - H_ky_k = [\beta \gamma^2(s_k - H_k y_k)^T y_k](s_k - H_ky_k)$$
显然当\(\beta \gamma^2(s_k - H_k y_k)^T y_k = 1\)时成立,则$$\beta \gamma^2 = \frac{1}{(s_k-H_ky_k)^Ty_k} \tag{5}$$
将(3)(5)代入(1),有

\[\begin{aligned} H_{k+1} & = H_k +\beta \gamma^2(s_k - H_ky_k)(s_k - H_ky_k)^T \\ & = H_k + \frac{(s_k-H_ky_k)(s_k-H_ky_k)^T}{(s_k-H_ky_k)^Ty_k} \end{aligned} \]

\(H_0 = I\),给出迭代初始值\(x_0\)和迭代停止阈值\(\epsilon > 0\)

while(1) {
计算梯度:\(\nabla f(x_k)\);
if(\(\| \nabla f(x_k) \|\) < \(\epsilon\))
计算搜索方向:\(d_k = -H_k \nabla f(x_k)\);
更新迭代点:\(x_{k+1} = x_k - d_k\);
更新\(H_{k+1} = H_k + \frac{(s_k-H_ky_k)(s_k-H_ky_k)^T}{(s_k-H_ky_k)^Ty_k}\);
}

DFP

其基本思路同\(SR1\)类似,我们令\(\Delta H_k = \beta uu^T + \gamma vv^T\),有$$H_{k+1} = H_k + \beta uu^T + \gamma vv^T \tag{1} $$
\(\beta\)\(\gamma\)是待定系数

在上式两边右乘\(y_k\): $$s_k = H_{k+1}y_k = H_ky_k + \beta uu^Ty_k + \gamma vv^T y_k \tag{2}$$
\(u = s_k, v = H_ky_k\)(取的过程我不知道┭┮﹏┭┮),改写(2): $$s_k - H_ky_k = \beta uu^T y_k + \gamma vv^T y_k$$
\(\beta uu^T y_k = s_k, \gamma vv^T y_k = - H_ky_k\),所以 $$u = s_k = \beta uu^Ty_k = (\beta u^Ty_k)u$$
为了上式恒成立,有\(\beta u^Ty_k = 1\),因此$$\beta = \frac{1}{u^Ty_k} = \frac{1}{s_k^Ty_k}$$
同理, \(\gamma = - \frac{1}{v^Ty_k} = - \frac{1}{y^T_kH_ky_k}\)

\(\beta, \gamma, u, v\)代入(1),我们可以得到迭代式 $$H_{k+1} = H_k + \frac{s_ks_kT}{s_KTy_k} - \frac{H_ky_ky_kTH_k}{y_kTH_ky_k}$$

\(H_0 = I\),给出迭代初始值\(x_0\)和迭代停止阈值\(\epsilon > 0\)

while(1) {
计算梯度:\(\nabla f(x_k)\);
if(\(\| \nabla f(x_k) \|\) < \(\epsilon\))
计算搜索方向:\(d_k = -H_k \nabla f(x_k)\);
更新迭代点:\(x_{k+1} = x_k - d_k\);
更新\(H_{k+1} = H_k + \frac{s_ks_k^T}{s_K^Ty_k} - \frac{H_ky_ky_k^TH_k}{y_k^TH_ky_k}\);
}

BFGS

考虑对\(B_k\) (对\(\nabla^2 f(x_k)\)的近似) 的修正,同DFP得到$$B_{k+1} = B_{k} + \frac{y_ky_kT}{y_kTs_k} - \frac{B_ks_ks_kTB_k}{s_kTB_ks_k} \tag{1} $$

引理1(矩阵求逆引理)

\[\begin{aligned} (A^{-1}+BD^{-1}C)^{-1} & \equiv A - AB(D+CAB)^{-1}CA \\ (D+CAB)^{-1} & \equiv D^{-1} - D^{-1}C(A^{-1}+BD^{-1}C)^{-1}BD^{-1} \\ AB(D+CAB)^{-1} & \equiv (A^{-1} + BD^{-1}C)^{-1}BD^{-1} \\ (D+CAB)^{-1}CA & \equiv D^{-1}C(A^{-1}+BD^{-1}C)^{-1}\end{aligned} \]

证明

对一个矩阵分别进行LDU和UDL分解:

\[\left[ \begin{matrix} A^{-1} & -B \\ C & D \end{matrix} \right] = \left[ \begin{matrix} 1 & 0 \\ CA & 1 \end{matrix} \right] \left[ \begin{matrix} A^{-1} & 0 \\ 0 & D+CAB \end{matrix} \right] \left[ \begin{matrix} 1 & -AB \\ 0 & 1 \end{matrix} \right] \tag{LDU} \]

\[\left[ \begin{matrix} A^{-1} & -B \\ C & D \end{matrix} \right] = \left[ \begin{matrix} 1 & -BD^{-1} \\ 0 & 1 \end{matrix} \right] \left[ \begin{matrix} A^{-1}+BD^{-1}C & 0 \\ 0 & D \end{matrix} \right] \left[ \begin{matrix} 1 & 0 \\ D^{-1}C & 1 \end{matrix} \right] \tag{ULD} \]

两边同时求逆

\[\begin{aligned} (\left[ \begin{matrix} A^{-1} & -B \\ C & D \end{matrix} \right])^{-1} & = \left[ \begin{matrix} 1 & AB \\ 0 & 1 \end{matrix} \right] \left[ \begin{matrix} A & 0 \\ 0 & (D+CAB)^{-1} \end{matrix} \right] \left[ \begin{matrix} 1 & 0 \\ -CA & 1 \end{matrix} \right] \\ & = \left[ \begin{matrix} A-AB(D+CAB)^{-1}CA & AB(D+CAB)^{-1} \\ -(D+CAB)^{-1}CA & (D+CAB)^{-1} \end{matrix} \right] \end{aligned} \]

\[\begin{aligned} (\left[ \begin{matrix} A^{-1} & -B \\ C & D \end{matrix} \right])^{-1} & = \left[ \begin{matrix} 1 & 0 \\ -D^{-1}C & 1 \end{matrix} \right] \left[ \begin{matrix} (A^{-1}+BD^{-1}C)^{-1} & 0 \\ 0 & D^{-1} \end{matrix} \right] \left[ \begin{matrix} 1 & BD^{-1} \\ 0 & 1 \end{matrix} \right] \\ & = \left[ \begin{matrix} (A^{-1}+BD^{-1}C)^{-1} & (A^{-1}+BD^{-1}C)^{-1}BD^{-1} \\ -D^{-1}C(A^{-1}+BD^{-1}C)^{-1} & D^{-1}-D^{-1}C(A^{-1}+BD^{-1}C)^{-1}BD^{-1} \end{matrix} \right] \end{aligned} \]

参考 《机器人学中的状态估计》 @深蓝学院

递归使用上面的引理1,令\(A = B_k + \frac{y_ky_k^T}{y_k^Ts_k},u = - \frac{B_ks_k}{s_k^TB_ks_k}, v = B_ks_k\)代入SMW公式,其中\(A^{-1}\)的计算再用一次SMW,可以得到

\[\begin{aligned} H_{k+1} = B^{-1}_{k+1} & = (B_k + \frac{y_ky_k^T}{y_k^Ts_k} - \frac{B_ks_ks_k^TB_k}{s_k^TB_ks_k})^{-1} \\ & = H_k + (1+\frac{y_k^TH_ky_k}{s_k^Ty_k}) \frac{s_ks_k^T}{s_k^Ty_k}-(\frac{s_ky_k^TH_k+H_ky_ks_k^T}{s_k^Ty_k}) \end{aligned} \]

\(H_0 = I\),给出迭代初始值\(x_0\)和迭代停止阈值\(\epsilon > 0\)

while(1) {
计算梯度:\(\nabla f(x_k)\);
if(\(\| \nabla f(x_k) \|\) < \(\epsilon\))
计算搜索方向:\(d_k = -H_k \nabla f(x_k)\);
更新迭代点:\(x_{k+1} = x_k - d_k\);
更新\(H_{k+1} = H_k + (1+\frac{y_k^TH_ky_k}{s_k^Ty_k}) \frac{s_ks_k^T}{s_k^Ty_k}-(\frac{s_ky_k^TH_k+H_ky_ks_k^T}{s_k^Ty_k})\);
}

posted @ 2023-12-15 23:07  诩言Wan  阅读(127)  评论(0编辑  收藏  举报