动手学pytorch-梯度下降

梯度下降

Boyd & Vandenberghe, 2004

一维梯度下降

证明:沿梯度反方向移动自变量可以减小函数值

泰勒展开:

\[f(x+\epsilon)=f(x)+\epsilon f^{\prime}(x)+\mathcal{O}\left(\epsilon^{2}\right) \]

代入沿梯度方向的移动量 \(\eta f^{\prime}(x)\)

\[f\left(x-\eta f^{\prime}(x)\right)=f(x)-\eta f^{\prime 2}(x)+\mathcal{O}\left(\eta^{2} f^{\prime 2}(x)\right) \]

\[f\left(x-\eta f^{\prime}(x)\right) \lesssim f(x) \]

\[x \leftarrow x-\eta f^{\prime}(x) \]

e.g.

\[f(x) = x^2 \]

学习率

局部极小值

e.g.

\[f(x) = x\cos cx \]

多维梯度下降

\[\nabla f(\mathbf{x})=\left[\frac{\partial f(\mathbf{x})}{\partial x_{1}}, \frac{\partial f(\mathbf{x})}{\partial x_{2}}, \dots, \frac{\partial f(\mathbf{x})}{\partial x_{d}}\right]^{\top} \]

\[f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\mathcal{O}\left(\|\epsilon\|^{2}\right) \]

\[\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f(\mathbf{x}) \]

\[f(x) = x_1^2 + 2x_2^2 \]

自适应方法

牛顿法

\(x + \epsilon\) 处泰勒展开:

\[f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\frac{1}{2} \epsilon^{\top} \nabla \nabla^{\top} f(\mathbf{x}) \epsilon+\mathcal{O}\left(\|\epsilon\|^{3}\right) \]

最小值点处满足: \(\nabla f(\mathbf{x})=0\), 即我们希望 \(\nabla f(\mathbf{x} + \epsilon)=0\), 对上式关于 \(\epsilon\) 求导,忽略高阶无穷小,有:

\[\nabla f(\mathbf{x})+\boldsymbol{H}_{f} \boldsymbol{\epsilon}=0 \text { and hence } \epsilon=-\boldsymbol{H}_{f}^{-1} \nabla f(\mathbf{x}) \]

收敛性分析

只考虑在函数为凸函数, 且最小值点上 \(f''(x^*) > 0\) 时的收敛速度:

\(x_k\) 为第 \(k\) 次迭代后 \(x\) 的值, \(e_{k}:=x_{k}-x^{*}\) 表示 \(x_k\) 到最小值点 \(x^{*}\) 的距离,由 \(f'(x^{*}) = 0\):

\[0=f^{\prime}\left(x_{k}-e_{k}\right)=f^{\prime}\left(x_{k}\right)-e_{k} f^{\prime \prime}\left(x_{k}\right)+\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) \text{for some } \xi_{k} \in\left[x_{k}-e_{k}, x_{k}\right] \]

两边除以 \(f''(x_k)\), 有:

\[e_{k}-f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right)=\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \]

代入更新方程 \(x_{k+1} = x_{k} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right)\), 得到:

\[x_k - x^{*} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right) =\frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \]

\[x_{k+1} - x^{*} = e_{k+1} = \frac{1}{2} e_{k}^{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \]

\(\frac{1}{2} f^{\prime \prime \prime}\left(\xi_{k}\right) / f^{\prime \prime}\left(x_{k}\right) \leq c\) 时,有:

\[e_{k+1} \leq c e_{k}^{2} \]

预处理 (Heissan阵辅助梯度下降)

\[\mathbf{x} \leftarrow \mathbf{x}-\eta \operatorname{diag}\left(H_{f}\right)^{-1} \nabla \mathbf{x} \]

梯度下降与线性搜索(共轭梯度法)

随机梯度下降

随机梯度下降参数更新

对于有 \(n\) 个样本对训练数据集,设 \(f_i(x)\) 是第 \(i\) 个样本的损失函数, 则目标函数为:

\[f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} f_{i}(\mathbf{x}) \]

其梯度为:

\[\nabla f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x}) \]

使用该梯度的一次更新的时间复杂度为 \(\mathcal{O}(n)\)

随机梯度下降更新公式 \(\mathcal{O}(1)\):

\[\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f_{i}(\mathbf{x}) \]

且有:

\[\mathbb{E}_{i} \nabla f_{i}(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x}) \]

e.g.

\[f(x_1, x_2) = x_1^2 + 2 x_2^2 \]

动态学习率

\[\begin{array}{ll}{\eta(t)=\eta_{i} \text { if } t_{i} \leq t \leq t_{i+1}} & {\text { piecewise constant }} \\ {\eta(t)=\eta_{0} \cdot e^{-\lambda t}} & {\text { exponential }} \\ {\eta(t)=\eta_{0} \cdot(\beta t+1)^{-\alpha}} & {\text { polynomial }}\end{array} \]

小批量随机梯度下降

读取数据

读取数据

0 1 2 3 4 5
0 800 0.0 0.3048 71.3 0.002663 126.201
1 1000 0.0 0.3048 71.3 0.002663 125.201
2 1250 0.0 0.3048 71.3 0.002663 125.951
3 1600 0.0 0.3048 71.3 0.002663 127.591
4 2000 0.0 0.3048 71.3 0.002663 127.461
5 2500 0.0 0.3048 71.3 0.002663 125.571
6 3150 0.0 0.3048 71.3 0.002663 125.201
7 4000 0.0 0.3048 71.3 0.002663 123.061
8 5000 0.0 0.3048 71.3 0.002663 121.301
9 6300 0.0 0.3048 71.3 0.002663 119.541

从零开始实现

对比

简洁实现

posted @   hou永胜  阅读(575)  评论(0编辑  收藏  举报
编辑推荐:
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
阅读排行:
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!
点击右上角即可分享
微信分享提示