Aman4Real - 博客园

论文笔记[4] GPT 1-3 梳理和对比

论文题目:
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners
论文传送门: GPT1 GPT2 GPT3
论文团队:OpenAI

GPT-1

Motivation & Objectives

  • most SOTA NLP models were trained specifically on a particular task

    • Limitations:
      ① need large amount of annotated data, not easily available
      ② fail to generalize
  • 2 Challenges:
    ① which optimization objectives are most effective?
    ② what’s the most efficient way to transfer learned representations to target task?

    ⇒ \Rightarrow A semi-supervised approach using a combination of unsupervised pre-training and supervised fine-tuning
    ⇒ \Rightarrow To learn a universal representation that transfers with little adaptation to a wide range of tasks

  • learning a generative language model using unlabeled data and then fine-tuning the model by providing examples of specific downstream tasks

  1. 不清楚要使用哪种optimization objectives在学习到用于迁移的表示中最有用
  2. 哪种方式来transfer这些learned representations to the target task还没达成共识(a combination of making task-specific changes to the model architecture,using intricate learning schemes,adding auxiliary(辅助的) learning objectives)

Framework

Unsupervised Language Modeling (Pre-training):

在这里插入图片描述
在这里插入图片描述
- multi-layer Transformer decoder (N = 12)
- NO Encoder-Decoder Attention Layer
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  1. 去掉Encoder-Decoder Attention层作为模型的主体,然后将decoder的输出经过一个softmax层,来产生目标词的输出分布。

  2. h n h_{n} hn可以理解为是输出层对词汇表中各个词语的注意力权重;而 h n W e h_{n}W_{e} hnWe就是输出层对各个token的注意力大小。经过预训练的GPT中,存储了从语料中学习到的语义和语法信息。

Supervised Fine-Tuning

在这里插入图片描述

  • objective to maximize
    在这里插入图片描述
    在这里插入图片描述
  • Advatages:
    ① improving generalization of the supervised model
    ② accelerating convergence

λ \lambda λ是超参数,设置为0.5

在这里插入图片描述

Experiment

  • Dataset: BooksCorpus (7000 unpublished books, unseen data) large stretches of contiguous text, which helped the model learn large range dependencies
  • Unsupervised Training:
    • Byte Pair Encoding (BPE) vocabulary with 40,000 merges was used
    • 768-dimensional state for encoding tokens
    • 12 layered model, 12 attention heads
    • position-wise feed forward layer: 3072-dimensional
    • 117M parameters
  • Supervised Fine-tuning:
    • 3 epochs for most of the downstream tasks → already learnt a lot during pre-training. Thus, minimal fine-tuning was enough
    • Most of the hyperparameters from unsupervised pre-training were used for fine-tuning
      在这里插入图片描述
      在这里插入图片描述
      在这里插入图片描述
  • Works well across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples)

Discussion

  • Impact of number of layers transferred

  • Evolution of zero-shot performance on different tasks as a function of LM pre-training updates
    在这里插入图片描述
    在这里插入图片描述

    • transformer block的个数越多,也就是语言模型越深,效果越好,说明语言模型的各个layer确实学到了不一样的东西
    • 不对语言模型进行fine-tuning时,pretrain语言模型的迭代次数越多,最后的效果越好,说明语言模型的pretrain确实学到了general的东西
  • Ablation studies
    在这里插入图片描述

  • larger datasets benefit from the auxiliary objective but smaller datasets do not

  • 5.6 average score drop when using LSTM (single layer 2048 unit LSTM)

  • lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease

Conclusion

  • GPT-1 performed SOTA in 9 out of 12 tasks

  • decent zero-shot performance on various tasks

  • GPT-1 proved that LM served as an effective pre-training objective. The architecture facilitated transfer learning and could perform various NLP tasks with very little fine-tuning.

GPT-2

Main Idea

  • Learning Objectives & Concepts

    • learning multiple tasks using the same unsupervised model (×supervised fine-tuning)
    • objective: P(output|input) → P(output|input, task) [task conditioning]
  • Zero Shot Learning and Zero Shot Task Transfer

    • no examples are provided and the model understands the task based on the given instruction
    • E.g. (translate to french, english text, french text)
  • LM = Unsupervised Multitask Learning

    • the supervised output is a subset of the language model sequence
    • E.g1. “The translation of word Machine Learning in chinese is 机器学习.”
    • E.g2. “The President of the United States is Trump.”

相比于有监督的多任务学习,语言模型只是不需要显示地定义哪些字段是要预测的输出,所以,实际上有监督的输出只是语言模型序列中的一个子集

Model Architecture

在这里插入图片描述

  • GPT-2 has 1.5 billion parameters [GPT-1 (117M parameters)]
  • 48 layers, 1600-dimension
  • Larger vocabulary of 50,257 tokens
  • Larger batch size of 512 and larger context window of 1024 tokens
  • Layer normalisation was moved to input of each sub-block and an additional layer normalisation was added after final self-attention block
  • At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers
    在这里插入图片描述
    在这里插入图片描述
    scaled by 1/√N是因为残差层的参数初始化根据网络深度进行调节。

Dataset & Experiment

  • WebText, had 40GB of text data from over 8 million documents (removed Wikipedia)
    在这里插入图片描述
  • In French to English translation task, GPT-2 performed better than most unsupervised models in zero shot setting but did not outperform the SOTA unsupervised model
  • GPT-2 could not perform well on text summarization and its performance was similar or lesser than classic models trained for summarization

Generalization vs Memorization

最近的计算机视觉研究表明,图像数据集通常都会包含一些类似的图像,例如CIFAR-10在训练集与测试集中就有3.3%的重复数据,这导致了对机器学习的泛化性能被过度高估。

  • Overlapping → over-reporting of the generalization performance of machine learning systems
  • Bloom filters containing 8-grams of WebText training set tokens
    在这里插入图片描述
  • recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.
  • performance on training and test are similar and improve together as model size is increased → underfitting on WebText in many ways
    在这里插入图片描述

Summary

  • achieve SOTA results on 7 out of 8 tested language modelling datasets in zero-shot
  • larger dataset & more parameters improved the capability of LM to understand tasks
  • How to use:

在这里插入图片描述

GPT-3

Introduction

  • Limitation: although task-agnostic, still a need for task-specific datasets and fine-tuning

  • BERT, etc:
    ① Excessive reliance on supervised data in the field
    ② Overfitting to the data distribution
    ⇒ \Rightarrow Focusing on more general NLP model
    ⇒ \Rightarrow Less supervised data, no fine-tuning.

  • Concepts

    • In-context learning: Large language models develop pattern recognition and other skills using the text data they are trained on.
    • Few-shot, one-shot and zero-shot setting: capability ↑ as capacity ↑

在这里插入图片描述
在这里插入图片描述

Model and Implementation details

  • GPT-3 has 96 layers with each layer having 96 attention heads.
  • Size of word embeddings: 1600 for GPT-2 → 12888 for GPT-3
  • Context window size: 1024 for GPT-2 → 2048 tokens for GPT-3
  • Alternating dense and locally banded sparse attention patterns
    在这里插入图片描述
    在这里插入图片描述
    • sparse attention:
      在这里插入图片描述
      使用交替的密集和局部带状的稀疏注意模式 (没说细节,参考论文Generating Long Sequences with Sparse Transformers). Sparse Transformer只关注k个贡献最大的状态。通过显式选择,只关注少数几个元素,与传统的注意方法相比,对于与查询不高度相关的值将被归0。

Experiment

  • Dataset (45TB): trained on a mix of 5 different corpora, each having certain weight assigned to it. High quality datasets were sampled more often, and model was trained for more than 1 epoch
    • downloaded and filtered a version of CommonCrawl
    • fuzzy deduplication
    • added high-quality reference corpora
      在这里插入图片描述
      在这里插入图片描述
      在这里插入图片描述
      在这里插入图片描述

Discussion & Broader Impacts

在这里插入图片描述

  • losing coherency while formulating long sentences and repeats sequences
  • does not perform very well on tasks like, fill in the blanks, some reading comprehension tasks etc.
  • Unidirectionality?
  • lacks the notion of task or goal-oriented prediction of tokens, suggests:
    • augmentation of learning objective, use of reinforcement learning to fine tune models, etc.
  • complex & costly, heavy architecture, less interpretability

  • misuse of its human-like text generating capability for phishing, spamming, spreading misinformation
  • gender, ethnicity, race & religion bias

Conclusion

  • BIGGER

  • Under Few-shot setting, it surpasses the current Fine-tuning SOTA on some NLU tasks

  • performs well on downstream NLP tasks in zero-shot and few-shot setting: writing articles, summing up numbers, writing codes, etc.

  • Most impressive: generalization


几个GPT-3的demo:


Ref:
[1] GPT-1论文翻译
[2] 【论文笔记】GPT-1:Improving Language Understanding by Generative Pre-Training
[3] GPT——生成式预训练Transformer
[4] The Journey of Open AI GPT models
[5] OpenAI GPT2原理解读
[6] GPT2.0 Language Models are Unsupervised Multitask Learners 论文解读
[7] 上车!带你一文了解GPT-2模型(transformer语言模型可视化)
[8] 总结GPT1和GPT2
[9] 直觀理解 GPT-2 語言模型並生成金庸武俠小說

posted @ 2022-05-27 19:37  aman4real  阅读(438)  评论(0编辑  收藏  举报
This blog has running: