Transformer计算公式

LLM inference workflow

Generative Inference. A typical LLM generative inference task consists of two stages: i) the prefill stage which takes a prompt sequence to generate the key-value cache (KV cache) for each transformer layer of the LLM; and ii) the decoding stage which utilizes and updates the KV cache to generate tokens step-by-step, where the current token generation depends on previously generated tokens.

prefill phase

Then, the cached key,value can be computed by:

xKi=xiwKi;xVi=xiwVi

The rest of the computation in the i-th layer is:

xQi=xiwQixOut i=fSoftmax (xQixKih)xViwOi+xixi+1=frelu (xOut iw1)w2+xOut i

decode phase

During the decode phase, given tiRb×1×h1 as the embedding of the current generated token in the i-th layer, the inference computation needs to i) update the KV cache:

xKi Concat (xKi,tiwKi)xVi Concat (xVi,tiwVi)

and ii) compute the output of the current layer:

tQi=tiwQitOut i=fSoftmax (tQixKih)xViwOi+titi+1=frelu (tOut iw1)w2+tOut i

posted @   鸽鸽的书房  阅读(155)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示