AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

概
AWQ
代码

Lin J., Tang J., Tang H., Yang S., Chen W., Wang W., Xiao G., Dang X., Gan C. and Han S. AWQ: Activation-aware weight quantization for llm compression and acceleration. MLSys, 2024.

概

随着模型的参数量的增加, 推理成本也在显著增加, 本文提出一种量化方法: AWQ 量化, 以缓解这一问题. 其主要贡献在于对于"重要"权重的特殊处理, 以及 per-channel 的 scaling.

AWQ

作者首先发现, 权重中的元素并不是同等重要的, 大约有 1% 的权重, 如果把他们以更高精度的方式保存 (如, FP16), 就能取得显著的效果提升 (上图 (a) -> (b)).
但是这种方式有一个显著的缺点, 这种混合精度会使得实际的实现变得异常麻烦, 所以需要另外的手段取解决.
一般的对称量化形如:

\[y = \mathbf{w} \mathbf{x} \longrightarrow y = Q(\mathbf{w}) \mathbf{x}, \]
其中

\[Q(\mathbf{w}) = \Delta \cdot \text{Round}(\frac{\mathbf{w}}{\Delta}), \quad \Delta = \frac{\max(|\mathbf{w}|)}{2^{N-1}}. \]
这里 \(N\) 是量化 Bits.
我们可以采用另外一种更加灵活的方式, 考虑只改变其中的一个元素 \(w\) 的量化方式:

\[Q(w \cdot s) \cdot \frac{x}{s} = \Delta' \cdot \text{Round}(\frac{ws}{\Delta'}) \cdot x \cdot \frac{1}{s}. \]
这里 \(\Delta' = \max(|w_1|, |w|_2, \ldots, w \cdot s, \cdots) / 2^{N-1}\). 注意到:
1. \(\text{Round}(\cdot)\) 所带来的误差是差不多的;
2. \(\Delta'\) 由于只改变一个元素, 通常 \(\Delta' \approx \Delta\).
当 \(s > 1\) 的时候, 由于 \(w \cdot s\) 的分布更加均匀地分布在量化区间内, 所以 \(w \cdot s\) 相较于 \(w\) 通常能够被更加精准地量化.
于是, 作者最终的 AWQ 为:

\[\mathbf{s^*} = \text{argmin}_{\mathbf{s}} \mathcal{L}(\mathbf{s}) \\ \mathcal{L}(\mathbf{s}) = \| Q(\mathbf{W} \text{diag}(\mathbf{s})) (\text{diag}(\mathbf{s})^{-1} \cdot \mathbf{X}) - \mathbf{W} \mathbf{X} \|. \]
其中 \(\mathbf{X}\) 是根据一个比较小的 calibration set 得到的.

代码

[official-code]

posted @ 2024-12-03 16:23 馒头and花卷阅读(280) 评论(0) 收藏举报

刷新页面返回顶部

馒头and花卷

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

概

AWQ

代码

公告