足迹

能看不尽景,始是不凡人

 

Transformer block拆解

基本结构

Alt text

basic参数

  • or : total number of transformer blocks

  • or : number of units in each bottleneck layer, and number of units of each Q/K/V input

  • or : number of heads of each transformer block

  • or : input sequence length

derived参数

  • : dimension of each attention head,

  • : intermediate layer units of feed forward layer,

各参数在transformer block中的详细示意图如下(可双击放大):

Alt text

Zoom in Feed Forward子模块

Alt text

典型模型基本参数

应用 模型
NLP GPT-3 96 12288 96 2048
NLP BERT_Base 12 768 12 128/512
NLP BERT_Large 24 1024 16 128/512
RecSys BST 1 128(max) 8 20
  • BST: Behavior Sequence Transformer

References

  1. The GPT-3 Architecture, on a Napkin

  2. GPT-3 An Overview

  3. Language Models are Few-Shot Learners

  4. Improving Language Understanding by Generative Pre-Training

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. Attention Is All You Need

  7. BERT transformer block code

  8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba

posted on 2021-07-26 18:54  姚伟峰  阅读(1209)  评论(0编辑  收藏  举报

导航