论文笔记[4] GPT 1-3 梳理和对比

论文题目：
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
论文传送门： GPT1 GPT2 GPT3
论文团队：OpenAI

GPT-1

Motivation & Objectives

most SOTA NLP models were trained specifically on a particular task
- Limitations:
  ① need large amount of annotated data, not easily available
  ② fail to generalize
2 Challenges:
① which optimization objectives are most effective?
② what’s the most efficient way to transfer learned representations to target task?

$\Rightarrow$ A semi-supervised approach using a combination of unsupervised pre-training and supervised fine-tuning
$\Rightarrow$ To learn a universal representation that transfers with little adaptation to a wide range of tasks
learning a generative language model using unlabeled data and then fine-tuning the model by providing examples of specific downstream tasks

不清楚要使用哪种optimization objectives在学习到用于迁移的表示中最有用
哪种方式来transfer这些learned representations to the target task还没达成共识（a combination of making task-specific changes to the model architecture，using intricate learning schemes，adding auxiliary(辅助的) learning objectives）

Framework

Unsupervised Language Modeling (Pre-training):

在这里插入图片描述

- multi-layer Transformer decoder (N = 12)
- NO Encoder-Decoder Attention Layer

去掉Encoder-Decoder Attention层作为模型的主体，然后将decoder的输出经过一个softmax层，来产生目标词的输出分布。
$h_{n}$ 可以理解为是输出层对词汇表中各个词语的注意力权重；而 $h_{n}W_{e}$ 就是输出层对各个token的注意力大小。经过预训练的GPT中，存储了从语料中学习到的语义和语法信息。

Supervised Fine-Tuning

在这里插入图片描述

objective to maximize
Advatages:
① improving generalization of the supervised model
② accelerating convergence

$\lambda$ 是超参数，设置为0.5

在这里插入图片描述

Experiment

Dataset: BooksCorpus (7000 unpublished books, unseen data) large stretches of contiguous text, which helped the model learn large range dependencies
Unsupervised Training:
- Byte Pair Encoding (BPE) vocabulary with 40,000 merges was used
- 768-dimensional state for encoding tokens
- 12 layered model, 12 attention heads
- position-wise feed forward layer: 3072-dimensional
- 117M parameters
Supervised Fine-tuning:
- 3 epochs for most of the downstream tasks → already learnt a lot during pre-training. Thus, minimal fine-tuning was enough
- Most of the hyperparameters from unsupervised pre-training were used for fine-tuning
Works well across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples)

Discussion

Impact of number of layers transferred
Evolution of zero-shot performance on different tasks as a function of LM pre-training updates
- transformer block的个数越多，也就是语言模型越深，效果越好，说明语言模型的各个layer确实学到了不一样的东西
- 不对语言模型进行fine-tuning时，pretrain语言模型的迭代次数越多，最后的效果越好，说明语言模型的pretrain确实学到了general的东西
Ablation studies
larger datasets benefit from the auxiliary objective but smaller datasets do not
5.6 average score drop when using LSTM (single layer 2048 unit LSTM)
lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease

Conclusion

GPT-1 performed SOTA in 9 out of 12 tasks
decent zero-shot performance on various tasks
GPT-1 proved that LM served as an effective pre-training objective. The architecture facilitated transfer learning and could perform various NLP tasks with very little fine-tuning.

GPT-2

Main Idea

Learning Objectives & Concepts
- learning multiple tasks using the same unsupervised model (×supervised fine-tuning)
- objective: P(output|input) → P(output|input, task) [task conditioning]
Zero Shot Learning and Zero Shot Task Transfer
- no examples are provided and the model understands the task based on the given instruction
- E.g. (translate to french, english text, french text)
LM = Unsupervised Multitask Learning
- the supervised output is a subset of the language model sequence
- E.g1. “The translation of word Machine Learning in chinese is 机器学习.”
- E.g2. “The President of the United States is Trump.”

相比于有监督的多任务学习，语言模型只是不需要显示地定义哪些字段是要预测的输出，所以，实际上有监督的输出只是语言模型序列中的一个子集

Model Architecture

在这里插入图片描述

GPT-2 has 1.5 billion parameters [GPT-1 (117M parameters)]
48 layers, 1600-dimension
Larger vocabulary of 50,257 tokens
Larger batch size of 512 and larger context window of 1024 tokens
Layer normalisation was moved to input of each sub-block and an additional layer normalisation was added after final self-attention block
At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers

scaled by 1/√N是因为残差层的参数初始化根据网络深度进行调节。

Dataset & Experiment

WebText, had 40GB of text data from over 8 million documents (removed Wikipedia)
In French to English translation task, GPT-2 performed better than most unsupervised models in zero shot setting but did not outperform the SOTA unsupervised model
GPT-2 could not perform well on text summarization and its performance was similar or lesser than classic models trained for summarization

Generalization vs Memorization

最近的计算机视觉研究表明，图像数据集通常都会包含一些类似的图像，例如CIFAR-10在训练集与测试集中就有3.3%的重复数据，这导致了对机器学习的泛化性能被过度高估。

Overlapping → over-reporting of the generalization performance of machine learning systems
Bloom filters containing 8-grams of WebText training set tokens
recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.
performance on training and test are similar and improve together as model size is increased → underfitting on WebText in many ways

Summary

achieve SOTA results on 7 out of 8 tested language modelling datasets in zero-shot
larger dataset & more parameters improved the capability of LM to understand tasks
How to use:

在这里插入图片描述

GPT-3

Introduction

Limitation: although task-agnostic, still a need for task-specific datasets and fine-tuning
BERT, etc:
① Excessive reliance on supervised data in the field
② Overfitting to the data distribution
$\Rightarrow$ Focusing on more general NLP model
$\Rightarrow$ Less supervised data, no fine-tuning.
Concepts
- In-context learning: Large language models develop pattern recognition and other skills using the text data they are trained on.
- Few-shot, one-shot and zero-shot setting: capability ↑ as capacity ↑

在这里插入图片描述

Model and Implementation details

GPT-3 has 96 layers with each layer having 96 attention heads.
Size of word embeddings: 1600 for GPT-2 → 12888 for GPT-3
Context window size: 1024 for GPT-2 → 2048 tokens for GPT-3
Alternating dense and locally banded sparse attention patterns
- sparse attention:
  
  使用交替的密集和局部带状的稀疏注意模式 (没说细节，参考论文Generating Long Sequences with Sparse Transformers). Sparse Transformer只关注k个贡献最大的状态。通过显式选择，只关注少数几个元素，与传统的注意方法相比，对于与查询不高度相关的值将被归0。

Experiment

Dataset (45TB): trained on a mix of 5 different corpora, each having certain weight assigned to it. High quality datasets were sampled more often, and model was trained for more than 1 epoch
- downloaded and filtered a version of CommonCrawl
- fuzzy deduplication
- added high-quality reference corpora

Discussion & Broader Impacts

在这里插入图片描述

losing coherency while formulating long sentences and repeats sequences
does not perform very well on tasks like, fill in the blanks, some reading comprehension tasks etc.
Unidirectionality？
lacks the notion of task or goal-oriented prediction of tokens, suggests:
- augmentation of learning objective, use of reinforcement learning to fine tune models, etc.
complex & costly, heavy architecture, less interpretability
misuse of its human-like text generating capability for phishing, spamming, spreading misinformation
gender, ethnicity, race & religion bias

Conclusion

BIGGER
Under Few-shot setting, it surpasses the current Fine-tuning SOTA on some NLU tasks
performs well on downstream NLP tasks in zero-shot and few-shot setting: writing articles, summing up numbers, writing codes, etc.
Most impressive: generalization

几个GPT-3的demo：

Ref：
[1] GPT-1论文翻译
[2] 【论文笔记】GPT-1：Improving Language Understanding by Generative Pre-Training
[3] GPT——生成式预训练Transformer
[4] The Journey of Open AI GPT models
[5] OpenAI GPT2原理解读
[6] GPT2.0 Language Models are Unsupervised Multitask Learners 论文解读
[7] 上车！带你一文了解GPT-2模型（transformer语言模型可视化）
[8] 总结GPT1和GPT2
[9] 直觀理解 GPT-2 語言模型並生成金庸武俠小說

posted @ 2022-05-27 19:37 aman4real 阅读(438) 评论(0) 编辑收藏举报

刷新页面返回顶部

aman4real

论文笔记[4] GPT 1-3 梳理和对比

目录

GPT-1

Motivation & Objectives

Framework

Unsupervised Language Modeling (Pre-training):

Supervised Fine-Tuning

Experiment

Discussion

Conclusion

GPT-2

Main Idea

Model Architecture

Dataset & Experiment

Generalization vs Memorization

Summary

GPT-3

Introduction

Model and Implementation details

Experiment

Discussion & Broader Impacts

Conclusion

公告