【论文阅读笔记】大模型推理加速 —— FastV

论文地址：https://arxiv.org/pdf/2403.06764
代码地址：https://github.com/pkunlp-icler/FastV

Introduction

给定 image-question pair \((d, t)\)，利用 decoder 自回归生成过程：

\[p(\hat{y})=\prod_{i=1}^Np_M\left(\hat{y}_i\mid\hat{y}_{1\sim i-1};d;t\right) \]

另 \(\alpha^{i,j}_{sys},\alpha^{i,j}_{img},\alpha^{i,j}_{ins},\alpha^{i,j}_{out}\) 代表第 \(j\) 层，第 \(i\) 个 token 的注意力分数。则有下面两种分数：

\[\text{total attention of system prompt in layer}\ j :\ \lambda_{sys}^{j}=\sum_{i=1}^{n}\alpha_{sys}^{i,j} \]

\[\text{attention efficiency of image tokens in layer}\ j:\ \epsilon_{img}^{j}=\frac{\sum_{i=1}^{n}\alpha_{img}^{i,j}}{|img|} \]

\[ranking \ \ function: f_{\phi} \]

\[filtering \ \ layer: K \]

\[filtering \ \ ratio: R \]

在第 \(K\) 层后，利用 \(f_{\phi}\) 对 token 的注意力分数进行排序（利用该 token 对于其他所有 token 的平均注意力得分），后 \(R\%\) 会被丢弃。

Same as LoRA, so straightforward that everyone can make delevopment based on this. It's a good start for MLLM's inference using plug-and-play module.

posted @ 2024-10-21 15:47 KeanShi 阅读(67) 评论(0) 编辑收藏举报

刷新页面返回顶部