GPT 1,2,3对比

 Generative Pre-trained Transformer (GPT)
 
总的来说,GPT1,2,3都是 单向transformer decoder结构,训练语言模型,最主要的是训练数据量和模型大小的区别,越来越多,越来越大
 
 
 
GPT1
GPT2
GPT3
paper
Improving Language Understanding by Generative Pre-Training link
Language Models are Unsupervised Multitask Learners link
Language Models are Few-Shot Learners link
学习目标
无监督语言模型(Pre-training),有监督fine-tune
 
多任务,P(output|input, task)
Zero Short Task Transfer
few shot
主要区别
 
增加语料、层数、维度
LN前移,最后加LN,初始化scale
增加语料、层数、维度
Dataset
7000 unpublished books,长文较多
WebText, 40GB, 8 million documents
Common Crawl, WebText2, Books1, Books2 and Wikipedia,共45TB
模型结构
12-layer decoder,12 heads,dim 768,ff 3072
48 layers,dim 1600
96 layers, 96 heads, dim 12888,
训练参数
100 epochs,batch_size 64,sequence length of 512,lr 2.5e-4,BPE vocab 40,000,
vocab 50,257, batch_size 512, context window 1024
context 2048, β_1=0.9, β_2=0.95, ε= 10^(-8)
模型参数量
117M parameters(1.17亿)
117M (same as GPT-1), 345M, 762M and 1.5B (GPT-2) parameters(15亿)
175 billion parameters(1750亿)
 

 

 
参考
https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2
 
posted @ 2021-05-23 17:00  AliceYing  阅读(2516)  评论(0编辑  收藏  举报