Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

合集 - 人工智能技术学习——AICradle(7)

1.人工神经网络：竞争型学习2024-08-19 2.《神经元网络时空行为的动力学研究》读书笔记2024-09-07 3.Hodgkin-Huxley Model 完全推导2024-09-12 4.神经辐射场 NeRF 相关公式推导2024-10-15 5.Hopfield 神经网络中能量函数的含义及其变化值 ΔE≤0 的证明2024-10-17

6.Scaled Dot-Product Attention 的公式中为什么要除以

\sqrt{d_{k}}

？2024-10-22

7.球坐标下的 Laplace 算子推导2024-10-27

Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

在学习 Scaled Dot-Product Attention 的过程中，遇到了如下公式

Attention (Q, K, V) = softmax (\frac{Q K}{\sqrt{d_{k}}}) V

不禁产生疑问，其中的 $\sqrt{d_{k}}$ 为什么是这个数，而不是 $d_{k}$ 或者其它的什么值呢？

Attention Is All You Need 中有一段解释

We suspect that for large values of $d_{k}$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\sqrt{d_{k}}$ .

这说明，两个向量的点积可能很大，导致 softmax 函数的梯度太小，因此需要除以一个因子，但是为什么是 $\sqrt{d_{k}}$ 呢？

文章中的一行注释提及到

To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$ . Then their dot product, $q \cdot k = \sum_{i = 1}^{d_{k}} q_{i} k_{i}$ has mean $0$ and variance $d_{k}$ .

本期，我们将基于上文的思路进行完整的推导，以证明 $\sqrt{d_{k}}$ 的在其中的作用.

基本假设

假设独立随机变量 $U_{1}, U_{2}, \dots, U_{d_{k}}$ 和独立随机变量 $V_{1}, V_{2}, \dots, V_{d_{k}}$ 分别服从期望为 $0$ ，方差为 $1$ 的分布，即

E (U_{i}) = 0, Var (U_{i}) = 1

E (V_{i}) = 0, Var (V_{i}) = 1

其中 $i = 1, 2, \dots, d_{k}$ ， $d_{k}$ 是个常数.

计算 $U_{i} V_{i}$ 的方差

由随机变量方差的定义可得 $U_{i} V_{i}$ 的方差为

\begin{aligned} Var (U_{i} V_{i}) & = E [{(U_{i} V_{i} - E (U_{i} V_{i}))}^{2}] \\ = E [{(U_{i} V_{i})}^{2} - 2 U_{i} V_{i} E (U_{i} V_{i}) + E^{2} (U_{i} V_{i})] \\ = E [{(U_{i} V_{i})}^{2}] - 2 E [U_{i} V_{i} E (U_{i} V_{i})] + E^{2} (U_{i} V_{i}) \\ = E (U_{i}^{2} V_{i}^{2}) - 2 E (U_{i} V_{i}) E (U_{i} V_{i}) + E^{2} (U_{i} V_{i}) \\ = E (U_{i}^{2} V_{i}^{2}) - E^{2} (U_{i} V_{i}) \end{aligned}

因为 $U_{i}$ 和 $V_{i}$ 是独立的随机变量，所以

E (U_{i} V_{i}) = E (U_{i}) E (V_{i})

从而

\begin{aligned} Var (U_{i} V_{i}) & = E (U_{i}^{2}) E (V_{i}^{2}) - {(E (U_{i}) E (V_{i}))}^{2} \\ = E (U_{i}^{2}) E (V_{i}^{2}) - E^{2} (U_{i}) E^{2} (V_{i}) \end{aligned}

又因为 $E (U_{i}) = E (V_{i}) = 0$ ，所以

Var (U_{i} V_{i}) = E (U_{i}^{2}) E (V_{i}^{2})

计算 $E (U_{i}^{2})$

因为

E (U_{i}) = 0

Var (U_{i}) = 1

Var (U_{i}) = E (U_{i}^{2}) - E^{2} (U_{i})

所以

E (U_{i}^{2}) = 1

同理，

E (V_{i}^{2}) = 1

计算 $q k$ 的方差

如果 $q = {[U_{1}, U_{2}, \dots, U_{d_{k}}]}^{T}$ ， $k = {[V_{1}, V_{2}, \dots, V_{d_{k}}]}^{T}$ ，那么

q k = \sum_{i = 1}^{d_{k}} U_{i} V_{i}

$q k$ 的方差

\begin{aligned} Var (q k) & = Var (\sum_{i = 1}^{d_{k}} U_{i} V_{i}) \\ = \sum_{i = 1}^{d_{k}} Var (U_{i} V_{i}) \\ = \sum_{i = 1}^{d_{k}} E (U_{i}^{2}) E (V_{i}^{2}) \\ = \sum_{i = 1}^{d_{k}} 1 \cdot 1 \\ = d_{k} \end{aligned}

到这里就可以解释为什么在最后要除以 $\sqrt{d_{k}}$ ，因为

\begin{aligned} Var (\frac{q k}{\sqrt{d_{k}}}) & = \frac{Var (q k)}{d_{k}} \\ = \frac{d_{k}}{d_{k}} \\ = 1 \end{aligned}

可见这个因子的目的是让 $q k$ 的分布也归一化到期望为 $0$ ，方差为 $1$ 的分布中，增强机器学习的稳定性.

参考文献/资料

Attention Is All You Need

posted @ 2024-10-22 18:05 赤川鹤鸣阅读(56) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 神经辐射场 NeRF 相关公式推导

· Hodgkin-Huxley Model 完全推导

· self-attention为什么要除以根号d_k

· Self-Attention

· Lucidrains-系列项目源码解析-十六-

公告

昵称：赤川鹤鸣
园龄： 6个月
粉丝： 0
关注： 0

+加关注

2025年2月

日

一

二

三

四

五

六

AkagawaTsurunaki

Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

基本假设

计算 $U_{i} V_{i}$ 的方差

计算 $E (U_{i}^{2})$

计算 $q k$ 的方差

公告

搜索

常用链接

我的标签

合集

随笔档案

阅读排行榜

推荐排行榜

AkagawaTsurunaki

Scaled Dot-Product Attention 的公式中为什么要除以 dk？

Scaled Dot-Product Attention 的公式中为什么要除以 dk？

基本假设

计算 UiVi 的方差

计算 E(Ui2)

计算 qk 的方差

公告

搜索

常用链接

我的标签

合集

随笔档案

阅读排行榜

推荐排行榜

Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

Scaled Dot-Product Attention 的公式中为什么要除以 $\sqrt{d_{k}}$ ？

计算 $U_{i} V_{i}$ 的方差

计算 $E (U_{i}^{2})$

计算 $q k$ 的方差