python系列&deep_study系列:拆解追溯 GPT-3.5 各项能力的起源
拆解追溯 GPT-3.5 各项能力的起源
符尧, yao.fu@ed.ac.uk
爱丁堡大学博士生,硕士毕业于
哥伦比亚大学,本科毕业于
北京大学
在
艾伦人工智能研究院 (Allen Institute for AI) 共同完成英文原稿
与
剑桥大学
郭志江 共同翻译为中文
感谢
Raj Ammanabrolu
(Allen Institute for AI),
Peter Liu
(Google Brain),
Brendan Dolan-Gavitt
(New York University),
Denny Zhou
(Google Brain) 对终稿的讨论和建议,他们的建议极大程度上增加了本文的完整度。
最近,OpenAI的预训练模型ChatGPT给人工智能领域的研究人员留下了深刻的印象和启发。毫无疑问,它又强又聪明,且跟它说话很好玩,还会写代码。它在多个方面的能力远远超过了自然语言处理研究者们的预期。于是我们自然就有一个问题:ChatGPT 是怎么变得这么强的?它的
各种强大的能力到底从何而来?在这篇文章中,我们试图剖析 ChatGPT 的
突现能力(Emergent Ability),
追溯这些能力的来源,希望能够给出一个全面的技术路线图,来说明
GPT-3.5 模型系列以及相关的
大型语言模型是如何一步步进化成目前的强大形态。
9
我们希望这篇文章能够促进大型语言模型的透明度,成为开源社区共同努力复现 GPT-3.5 的路线图。
Recently, the field has been greatly impressed and inspired by OpenAI’s ChatGPT. It is undoubtedly clever, capable, and very fun to talk to. Its multi-faceted abilities are significantly beyond many NLP researchers’ and practitioners’ expectations based on the impression of (not-that-strong) original GPT-3. The natural question is how ChatGPT gets there, and
where these fantastic abilities come from. In this post, we try to
dissect the emergent abilities and trace them to their sources, hoping to give a comprehensive roadmap about how the GPT-3.5 model family, along with related large language models, evolved to their current forms.
We hope this post can promote the transparency of large language models and serve as the roadmap for the community’s ongoing efforts of reproducing GPT-3.5.
目录
多年以后,面对行刑队,奥雷里亚诺·布恩迪亚上校将会回想起父亲带他去见识冰块的那个遥远的下午 。 —— 《百年孤独》 加西亚·马尔克斯3
一、2020 版初代 GPT-3 与大规模预训练
初代GPT-3展示了三个重要能力:
语言生成:遵循提示词(prompt),然后生成补全提示词的句子 (completion)。这也是今天人类与语言模型最普遍的交互方式。
上下文学习 (in-context learning): 遵循给定任务的几个示例,然后为新的测试用例生成解决方案。很重要的一点是,GPT-3虽然是个语言模型,但它的论文几乎没有谈到“语言建模” (language modeling) —— 作者将他们全部的写作精力都投入到了对上下文学习的愿景上,这才是 GPT-3的真正重点。
1
世界知识 (world knowledge):包括事实性知识 (factual knowledge) 和常识 (commonsense)。
那么这些能力从何而来呢?
基本上,以上三种能力都来自于大规模预训练:在有3000亿单词的语料上预训练拥有1750亿参数的模型( 训练语料的60%来自于 2016 - 2019 的 C4 + 22% 来自于 WebText2 + 16% 来自于Books + 3%来自于Wikipedia)。其中:
语言生成的能力来自于语言建模的
训练目标 (language modeling)。
世界知识来自 3000 亿单词的
训练语料库(不然还能是哪儿呢)。
模型的 1750 亿参数是为了
存储知识,Liang et al. (2022) 的文章进一步证明了这一点。 他们的结论是,
知识密集型任务的性能与模型大小息息相关。
1
上下文学习的能力来源及为什么上下文学习可以泛化,
仍然难以溯源。直觉上,
这种能力可能来自于同一个任务的数据点在训练时按顺序排列在同一个 batch 中。然而,
很少有人研究为什么语言模型预训练会促使上下文学习,以及为什么上下文学习的行为与微调 (fine-tuning) 如此不同。
4
There are three important abilities that the initial GPT-3 exhibit:
Language generation: to follow a prompt and then generate a completion of the given prompt. Today, this might be the most ubiquitous way of human-LM interaction.
In-context learning: to follow a few examples of a given task and then generate the solution for a new test case. It is interesting to note that, although being a language model, the original GPT-3 paper barely talks about “language modeling” — the authors devoted their writing efforts to their visions of in-context learning, which is the real focus of GPT-3.
World knowledge: including factual knowledge and commonsense.
Where do these abilities come from?
Generally, the above three abilities should come from large-scale pretraining — to pretrain the 175B parameters model on 300B tokens (60% 2016 - 2019 C4 + 22% WebText2 + 16% Books + 3% Wikipedia). Where:
The
language generation ability comes from the language modeling
training objective.
The
world knowledge comes from the 300B token
training corpora (or where else it could be).
The
175B model size is for
storing knowledge, which is further evidenced by Liang et al. (2022), who conclude that the performance on tasks requiring knowledge correlates with model size.
The source of the
in-context learning ability, as well as its generalization behavior,
is still elusive. Intuitively, this ability may come from the fact that data points of the same task are ordered sequentially in the same batch during pretraining. Yet there is little study on why language model pretraining induces in-context learning, and why in-context learning behaves so differently than fine-tuning.
令人好奇的是,初代
的GPT-3有多强。 其实比较难确定初代 GPT-3(在 OpenAI API 中被称为
davinci)到底是“强”还是“弱”。一方面,它合理地回应了某些特定的查询,并在许多数据集中达到了还不错的性能;另一方面,它在许多任务上的
表现还不如 T5 这样的小模型(参见其原始论文)。在今天(2022 年 12 月)ChatGPT 的标准下,很难说初代的 GPT-3 是“智能的”。Meta 开源的 OPT 模型试图复现初代 GPT-3,但它的能力与当今的标准也形成了尖锐的对比。许多测试过 OPT 的人也认为与现在的
text-davinci-002相比,该模型确实 “不咋地”。尽管如此,OPT 可能是初代 GPT-3 的一个足够好的开源的近似模型了(根据 OPT 论文和斯坦福大学的 HELM 评估)。
虽然初代的 GPT-3 可能表面上看起来很弱,但后来的实验证明,初代 GPT-3 有着非常强的潜力。这些潜力后来被代码训练、指令微调 (instruction tuning) 和基于人类反馈的强化学习 (reinforcement learning with human feedback, RLHF) 解锁,最终体展示出极为强大的突现能力。
A curious question is
how strong the initial GPT-3 is.
It is rather challenging to determine whether the initial GPT-3 (
davinci in OpenAI API) is “strong” or “weak.” On the one hand, it responds to certain queries reasonably and achieves OK-ish performance on many benchmarks; on the other,
it underperforms small models like T5 on many tasks (see its original paper). It is also very hard to say the initial GPT-3 is “smart” in today's (= Dec 2022) ChatGPT standard. The sharp comparison of initial GPT-3’s ability v.s. today’s standard is replayed by Meta’s OPT model, which is viewed as “just bad” by many who have tested OPT (compared to
text-davinci-002). Nevertheless, OPT might be a good enough open-source approximation to the initial GPT-3 (according to the OPT paper and Stanford’s HELM evaluation).
Although the initial GPT-3 might be superficially weak, it turns out later that these abilities serve as very important foundations of all the emergent abilities unlocked later by training on code, instruction tuning, and reinforcement learning with human feedback (RLHF).
二、从 2020 版 GPT-3 到 2022 版 ChatGPT
从最初的 GPT-3 开始,为了展示 OpenAI 是如何发展到ChatGPT的,我们看一下 GPT-3.5 的进化树:
ALT
在
2020 年 7 月,OpenAI 发布了模型索引为的
davinci 的初代
GPT-3 论文,从此它就开始不断进化。在
2021 年 7 月,
Codex 的论文发布,其中初始的 Codex 是根据(可能是内部的)120 亿参数的 GPT-3 变体进行微调的。后来这个 120 亿参数的模型演变成 OpenAI API 中的
code-cushman-001。在
2022 年 3 月,OpenAI 发布了
指令微调 (instruction tuning) 的论文,其
监督微调 (
supervised instruction tuning) 的部分对应了
davinci-instruct-beta和
text-davinci-001。在
2022 年 4 月至 7 月的,OpenAI 开始对
code-davinci-002模型进行 Beta 测试,也称其为 Codex。然后
code-davinci-002、
text-davinci-003和
ChatGPT 都是从
code-davinci-002进行指令微调得到的。详细信息请参阅
OpenAI的模型索引文档。
8
尽管 Codex 听着像是一个只管代码的模型,但
code-davinci-002可能是
最强大的针对
自然语言的GPT-3.5 变体(优于
text-davinci-002和
-003)。
code-davinci-002很可能在文本和代码上都经过训练,然后根据指令进行调整(将在下面解释)。然后
2022 年 5-6 月发布的
text-davinci-002是一个基于
code-davinci-002的有监督指令微调 (supervised instruction tuned) 模型。
在
text-davinci-002
上面进行
指令微调
很可能
降低
了模型的
上下文学习
能力
,
但是
增强了
模型的
零样本能力
(将在下面解释)。然后是
text-davinci-003和
ChatGPT,它们都在
2022 年 11 月发布,是使用的基于人类反馈的强化学习的版本指令微调 (instruction tuning with reinforcement learning from human feedback) 模型的两种不同变体。
text-davinci-003 恢复了(但仍然比
code-davinci-002差)一些在
text-davinci-002 中丢失的部分
上下文学习能力(大概是因为它在微调的时候混入了语言建模) 并进一步改进了零样本能力(得益于RLHF)。另一方面,ChatGPT 似乎
牺牲了几乎所有的上下文学习的能力来
换取建模对话历史的能力。
7
总的来说,在 2020 - 2021 年期间,在
code-davinci-002之前,OpenAI 已经投入了大量的精力通过代码训练和指令微调来增强GPT-3。当他们完成
code-davinci-002时,所有的能力都已经存在了。很可能后续的指令微调,无论是通过有监督的版本还是强化学习的版本,都会做以下事情(稍后会详细说明):
指令微调
不会为模型注入新的能力 —— 所有的能力都已经存在了。指令微调的作用是
解锁 / 激发这些能力。这主要是因为指令微调的数据量比预训练数据量少几个数量级(基础的能力是通过预训练注入的)。
指令微调
将 GPT-3.5 的分化到不同的技能树。有些更擅长上下文学习,如
text-davinci-003,有些更擅长对话,如
ChatGPT。
指令微调
通过牺牲性能换取与人类的对齐(alignment)。 OpenAI 的作者在他们的
指令微调论文中称其为 “对齐税” (alignment tax)。
许多论文都报道了
code-davinci-002在基准测试中实现了最佳性能(但模型不一定符合人类期望)。 在
code-davinci-002上进行指令微调后,模型可以生成更加符合人类期待的反馈(或者说模型与人类对齐),例如:零样本问答、生成安全和公正的对话回复、拒绝超出模型它知识范围的问题。
2
In
Jul 2020. OpenAI released the initial GPT-3 paper with the
davinci model index, and it started to evolve. In
Jul 2021, the Codex paper was released, where the initial Codex is fine-tuned from a (presumably internal) 12B GPT-3 variant. Later this 12B model evolved to be the
code-cushman-001 in OpenAI API. In
Mar 2022, OpenAI released the instruction tuning paper, and its supervised tuning part corresponds to the
davinci-instruct-beta and
text-davinci-001. At some point in
Apr-Jul 2022, OpenAI started to beta test the
code-davinci-002 model, also calling it Codex. Then
text-davinci-002,
text-davinci-003, and
ChatGPT are all instruction-tuned from
code-davinci-002. See OpenAI’s Model Index document for more details.
Although called Codex,
code-davinci-002 is probably
the most capable GPT-3.5 variant for
natural language (better than text-davinci-002 and 003). It is very likely code-davinci-002 is trained on both text and code, then tuned on instructions (will explain below). Then text-davinci-002, released in
May-Jun 2022, is a supervised instruction-tuned model based on code-davinci-002. It is very likely that the
instruction tuning on text-davinci-002
decreased the model’s
in-context learning ability but
increased the model’s
zero-shot ability (will explain below). Then text-davinci-003 and ChatGPT, both released in
Nov 2022, are two different variants of instruction-tuned models using Reinforcement Learning with Human Feedback. text-davinci-003
recovered (but still worse than code-davinci-002) some
in-context learning ability that is lost in text-davinci-002 (presumably because it tunes the model with LM mix-in)
and further improved zero-shot ability (thanks to RLHF). On the other hand, ChatGPT seems to have
sacrificed nearly all of its
in-context learning ability to
trade for the ability to model
dialog context.
In summary, during 2020-2021, before code-davinci-002, substantial efforts have been devoted to enhancing GPT-3 with code training and instruction tuning. When they have reached code-davinci-002, all the abilities are there. It is likely that the following-up instruction-tuning, either supervised or RLHF, does the following things (will detail later):
Instruction tuning does
not inject new abilities into the model — all abilities are already there. Instead, instruction tuning
unlocks/ elicit these abilities. This is mostly because the instruction tuning data is orders or magnitudes less than the pretraining data.
Instruction tuning
adjusts skillsets of GPT-3.5
towards different branches. Some are better at in-context learning like text-davinci-003, some are better at dialog like ChatGPT.
Instruction tuning
trade performance for alignment with humans. The OpenAI authors call it “alignment tax” in their instruction tuning paper. Also, many papers have reported code-davinci-002 achieves the best performance on benchmarks. Instruction tuning on code-davinci-002 gives the subsequent models alignments like zero-shot question answering, generating safe and impartial dialog responses, and rejecting questions beyond its knowledge scope.
三、Code-Davinci-002和 Text-Davinci-002,在代码上训练,在指令上微调
在
code-davinci-002和
text-davinci-002之前,有两个中间模型,分别是 davinci-instruct-beta 和 text-davinci-001。两者在很多方面都比上述的两个-002模型差(例如,text-davinci-001
链式思维推理能力不强)。所以我们在本节中重点介绍 -002 型号。
1
Before code-davinci-002 and text-davinci-002, there are two intermediate models, namely davinci-instruct-beta and text-davinci-001. Both are worse than the two -002 models in many aspects (e.g., text-davinci-001 cannot do chain-of-thought reasoning). So we focus on the -002 models in this section.
3.1 复杂推理能力的来源和泛化到新任务的能力
我们关注
code-davinci-002和
text-davinci-002,这两兄弟是第一版的 GPT3.5 模型,一个用于代码,另一个用于文本。它们表现出了四种与初代 GPT-3 不同的重要能力:
响应人类指令:以前,GPT-3 的输出
主要训练集中常见的句子。现在的模型会针对指令 / 提示词生成更合理的答案(而不是相关但无用的句子)。
2
泛化到没有见过的任务:当用于调整模型的指令数量超过一定的规模时,模型就可以自动在从没见过的新指令上也能生成有效的回答。
这种能力对于上线部署至关重要,因为用户总会提新的问题,模型得答得出来才行。
代码生成和代码理解:这个能力很显然,因为模型用代码训练过。
利用思维链 (chain-of-thought) 进行复杂推理:初代 GPT3 的模型
思维链推理的能力很弱甚至没有。
code-davinci-002 和 text-davinci-002 是两个拥有足够强的思维链推理能力的模型。
1
思维链推理之所以重要,是因为思维链可能是解锁突现能力和超越缩放法则 (scaling laws) 的关键。
请参阅上一篇博文。
1
Now let’s look at code-davinci-002 and text-davinci-002, the two first GPT3.5 models, one for code and the other for text. There are three important abilities they exhibit that differentiate them from the initial GPT-3
Responding to human instruction: previously, the outputs of GPT-3 were mostly high-frequency prompt-completion patterns within the training set. Now the model generates reasonable answers to the prompt, rather than related but useless sentences.
Generalization to unseen tasks: when the number of instructions used for tuning the model is beyond a certain scale, the model can automatically generate completions for new instructions that are not in the training set.
This ability is crucial for deployment, as users with always come up with new prompts.
Code generation and code understanding: obviously, because the model is trained on code.
Complex reasoning
with chain-of-thought: previously, the model could not do tasks requiring multi-step reasoning with chain-of-thought.
codex-davinci-002 and text-davinci-002 are the two initial models exhibiting chain-of-thought reasoning ability.
The reason that chain-of-thought is important is because that CoT is likely to be the key to unlock the emergent abilities and transcend scaling laws. See the previous blog post.
这些能力从何而来?
与之前的模型相比,两个主要区别是
指令微调和
代码训练。具体来说
能够
响应人类指令的能力是
指令微调的直接产物。
对没有见过的指令做出反馈的泛化能力是在指令数量超过一定程度之后
自动出现的,
T0、
Flan 和
FlanPaLM 论文进一步证明了这一点
3
使用
思维链进行
复杂推理的能力很可能是
代码训练的
一个神奇的副产物。对此,我们有以下的事实作为一些支持:
最初的 GPT-3 没有接受过代码训练,它不能做
思维链。
text-davinci-001 模型,虽然经过了指令微调,但
第一版思维链论文报告说,它的它思维链推理的能力非常弱 ——
所以指令微调可能不是思维链存在的原因,代码训练才是模型能做思维链推理的最可能原因。
1
PaLM 有 5% 的代码训练数据,可以做思维链。
1
Codex论文中的代码数据量为 159G ,
大约是初代
GPT-3
5700 亿训练数据的28%。code-davinci-002 及其后续变体可以做思维链推理。
4
在 HELM 测试中,
Liang et al. (2022) 对不同模型进行了大规模评估。 他们发现了针对代码训练的模型具有很强的语言推理能力,包括 120亿参数的code-cushman-001.。
1
我们在 AI2 的工作也表明,当配备复杂的思维链时,code-davinci-002 在 GSM8K 等重要数学基准上是目前表现最好的模型
1
直觉来说,
面向过程的编程 (procedure-oriented programming) 跟人类
逐步解决任务的过程很类似,
面向对象编程 (object-oriented programming) 跟人类
将复杂任务分解为多个简单任务的过程很类似。
以上所有观察结果都是代码与推理能力 / 思维链 之间的相关性,但不一定是因果性。这种相关性很有趣,但现在还是一个待研究的开放性问题。目前看来,我们
没有非常确凿的证据证明代码就是思维链和复杂推理的原因。
此外,
代码训练另一个可能的副产品是
长距离依赖
,正如
Peter Liu所指出:“语言中的下个词语预测通常是非常局部的,而代码通常需要更长的依赖关系来做一些事情,比如前后括号的匹配或引用远处的函数定义”。这里我想进一步补充的是:由于面向对象编程中的类继承,代码也可能有助于模型建立编码层次结构的能力。我们将对这一假设的检验留给未来的工作。
1
Where do these abilities come from?
Compared to the previous models, the two major differences are
instruction tuning and
training on code. Specifically
The ability to
respond to human
instructions is a direct product of
instruction tuning.
The ability of
generalization to
unseen instructions is a
free lunch given by
scaling types of
instructions, as is further evidenced by T0, Flan, and FlanPaLM papers
The ability of
complex reasoning with
chain-of-thought is likely to be
a magical side product of
training on code:
The initial GPT-3 is not trained on code, and it cannot do chain-of-thought
The text-davinci-001, although being instruction tuned,
cannot do CoT (corrected by Denny Zhou) can do CoT but the performance is significantly worse, as is reported by the first version of the CoT paper — so
instruction tuning may not be the reason for CoT. This leaves training on code to be be the number one suspect.
PaLM has 5% code training data, and it can do chain-of-thought.
The code data in the codex paper is 159G, approximately 28% of the initial GPT-3 570G training data. code-davinci-002 and its subsequent variants can do chain-of-thought.
On the HELM evaluation, a massive-scale evaluation performed by Liang et al. (2022), the authors also found that models trained on/ for code has strong language reasoning abilities, including the 12B-sized code-cushman-001.
Our work at AI2 also shows that when equipped with complex chains of thought, code-davinci-002 is the SOTA model on important math benchmarks like GSM8K
As an intuition, think about how
procedure-oriented programming is similar to
solving tasks step by step, and how
object-oriented programming is similar to
decomposing complex tasks into simpler ones.
All the above observations are correlations between code and reasoning ability/ CoT. Such a correlation between code and reasoning ability/ CoT is very intriguing to the community and not well-understood. However,
there is still no hard evidence showing training on code is absolutely the reason for CoT and complex reasoning. The source of CoT is still an open research problem.
Additionally,
long-term dependency might also be a nice side effect of
training on code. As is pointed out by Peter Liu. “Next token prediction for language is usually very local, whereas code often requires longer dependencies to do things like close brackets or refer to distant defs”. I would further add: code may also give the model of encoding hierarchy due to inheritance in object-oriented programming. We leave the test of this hypothesis to future work.
另外还要注意一些细节差异:
text-davinci-002 与 code-davinci-002
Code-davinci-002 是基础模型,text-davinci-002 是指令微调 code-davinci-002 的产物(见
OpenAI 的文档)。它在以下数据上作了微调:(一)人工标注的指令和期待的输出;(二)由人工标注者选择的模型输出。
1
当有上下文示例 (in-context example) 的时候, Code-davinci-002 更擅长上下文学习;当没有上下文示例 / 零样本的时候, text-davinci-002 在零样本任务完成方面表现更好。从这个意义上说,text-davinci-002 更符合人类的期待(因为对一个任务写上下文示例可能会比较麻烦)。
OpenAI 不太可能故意牺牲了上下文学习的能力换取零样本能力 —— 上下文学习能力的降低更多是指令学习的一个副作用,OpenAI 管这叫对齐税。
001 模型(code-cushman-001 和 text-davinci-001)v.s. 002 模型(code-davinci-002 和 text-davinci-002)
001 模型主要是为了做纯代码 / 纯文本任务; 002 模型则深度融合了代码训练和指令微调,代码和文本都行。
2