GPT 1,2,3对比
Generative Pre-trained Transformer (GPT)
总的来说,GPT1,2,3都是 单向transformer decoder结构,训练语言模型,最主要的是训练数据量和模型大小的区别,越来越多,越来越大
GPT1
|
GPT2
|
GPT3
|
|
paper
|
Improving Language Understanding by Generative Pre-Training link
|
Language Models are Unsupervised Multitask Learners link
|
Language Models are Few-Shot Learners link
|
学习目标
|
无监督语言模型(Pre-training),有监督fine-tune
|
多任务,P(output|input, task)
Zero Short Task Transfer
|
few shot
|
主要区别
|
增加语料、层数、维度
LN前移,最后加LN,初始化scale
|
增加语料、层数、维度
|
|
Dataset
|
7000 unpublished books,长文较多
|
WebText, 40GB, 8 million documents
|
Common Crawl, WebText2, Books1, Books2 and Wikipedia,共45TB
|
模型结构
|
12-layer decoder,12 heads,dim 768,ff 3072
|
48 layers,dim 1600
|
96 layers, 96 heads, dim 12888,
|
训练参数
|
100 epochs,batch_size 64,sequence length of 512,lr 2.5e-4,BPE vocab 40,000,
|
vocab 50,257, batch_size 512, context window 1024
|
context 2048, β_1=0.9, β_2=0.95, ε= 10^(-8)
|
模型参数量
|
117M parameters(1.17亿)
|
117M (same as GPT-1), 345M, 762M and 1.5B (GPT-2) parameters(15亿)
|
175 billion parameters(1750亿)
|
参考
https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2