论文解析 -- A Survey of Large Language Models
https://zhuanlan.zhihu.com/p/611403556,总结当下可用的大模型
什么是语言模型?生成式,完成语言接龙或填空
Technically, language modeling (LM) is one of the major approaches to advancing language intelligence of machines.
In general, LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens.
The research of LM has received extensive attention in the literature, which can be divided into four major development stages:
Statistical language models (SLM). 统计LM,N-Gram
SLMs [4–7] are developed based on statistical learning methods that rose in the 1990s.
The basic idea is to build the word prediction model based on the Markov assumption, e.g., predicting the next word based on the most recent context.
The SLMs with a fixed context length n are also called n-gram language models, e.g., bigram and trigram language models.
SLMs have been widely applied to enhance task performance in information retrieval (IR) [8, 9] and natural language processing (NLP) [10–12].
Neural language models (NLM). 神经网络LM,RNN,word2vec
NLMs [15–17] characterize the probability of word sequences by neural networks, e.g., recurrent neural networks (RNNs).
Further, word2vec [19, 20] was proposed to build a simplified shallow neural network for learning distributed word representations, which were demonstrated to be very effective across a variety of NLP tasks.
These studies have initiated the use of language models for representation learning (beyond word sequence modeling), having an important impact on the field of NLP.
Pre-trained language models (PLM). 预训练LM,ELMO,BERT,GPT2,需要针对特定任务fine-tuning
As an early attempt, ELMo [21] was proposed to capture context-aware word representations by first pre-training a bidirectional LSTM (biLSTM) network (instead of learning fixed word representations) and then fine-tuning the biLSTM network according to specific downstream tasks.
Further, based on the highly parallelizable Transformer architecture [22] with self-attention mechanisms, BERT [23] was proposed by pre-training bidirectional language models with specially designed pre-training tasks on large-scale unlabeled corpora.
These pre-trained context-aware word representations are very effective as general-purpose semantic features, which have largely raised the performance bar of NLP tasks. This study has inspired a large number of follow-up work, which sets the “pre-training and fine-tuning” learning paradigm.
Following this paradigm, a great number of studies on PLMs have been developed, introducing either different architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or improved pre-training strategies [27–29]. In this paradigm, it often requires fine-tuning the PLM for adapting to different downstream tasks.
Large language models (LLM). 更大规模的PLM,GPT3,PaLM,产生emergent abilities
Researchers find that scaling PLM (e.g., scaling model size or data size) often leads to an improved model capacity on downstream tasks (i.e., following the scaling law [30]).
A number of studies have explored the performance limit by training an ever larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- parameter PaLM).
Although scaling is mainly conducted in model size (with similar architectures and pre-training tasks), these large-sized PLMs display different behaviors from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- parameter GPT-2) and show surprising abilities (called emergent abilities [31]) in solving a series of complex tasks.
For example, GPT-3 can solve few-shot tasks through in-context learning, whereas GPT-2 cannot do well.
Thus, the research community coins the term “large language models (LLM)” for these large-sized PLMs [32–35].
A remarkable application of LLMs is ChatGPT that adapts the LLMs from the GPT series for dialogue, which presents an amazing conversation ability with humans.
PLM和LLM的区别,emergent abilites,以prompting为主要交互方式,训练LLM需要大量的工程经验
First, LLMs display some surprising emergent abilities that may not be observed in previous smaller PLMs. These abilities are key to the performance of language models on complex tasks, making AI algorithms unprecedently powerful and effective.
Second, LLMs would revolutionize the way that humans develop and use AI algorithms. Unlike small PLMs, the major approach to accessing LLMs is through the prompting interface (e.g., GPT-4 API). Humans have to understand how LLMs work and format their tasks in a way that LLMs can follow.
Third, the development of LLMs no longer draws a clear distinction between research and engineering. The training of LLMs requires extensive practical experiences in large-scale data processing and distributed parallel training. To develop capable LLMs, researchers have to solve complicated engineering issues, working with engineers or being engineers.
LLM这次最大的影响,是让大家从新思考AGI的可能性。LLM的超出NLP领域的泛化能力,让大家觉得GPT4可能已经具备AGI的初级阶段。
Nowadays, LLMs are posing a significant impact on the AI community, and the advent of ChatGPT and GPT-4 leads to the rethinking of the possibilities of artificial general intelligence (AGI).
OpenAI has published a technical article entitled “Planning for AGI and beyond”, which discusses the short-term and long-term plans to approach AGI [40], and a more recent paper has argued that GPT-4 might be considered as an early version of an AGI system [41].
OVERVIEW
Background for LLMs
Typically, large language models (LLMs) refer to Transformer language models that contain hundreds of billions (or more) of parameters, which are trained on massive text data [32], such as GPT-3 [55], PaLM [56], Galactica [35], and LLaMA [57]. LLMs exhibit strong capacities to understand natural language and solve complex tasks (via text generation). To have a quick understanding of how LLMs work, this part introduces the basic background for LLMs, including scaling laws, emergent abilities and key techniques.
LLM和小模型的结构相同,单纯的只是模型size和数据size的显著变大。
所以介绍,KM scaling law和Chinchilla scaling law来量化,规模和能力直接的关系。
Scaling Laws for LLMs. Currently, LLMs are mainly built upon the Transformer architecture [22], where multi-head attention layers are stacked in a very deep neural network.
Existing LLMs adopt similar Transformer architectures and pre-training objectives (e.g., language modeling) as small language models.
While, LLMs largely scale the model size, data size, and total compute (orders of magnification).
Extensive research has shown that scaling can largely improve the model capacity of LLMs [26, 55, 56].
Thus, it is useful to establish a quantitative approach to characterizing the scaling effect. Next, we introduce two representative scaling laws for Transformer language models [30, 34].
KM scaling law. In 2020, Kaplan et al. [30] (the OpenAI team) firstly proposed to model the power-law relationship of model performance with respective to three major factors, namely model size (N), dataset size (D), and the amount of training compute (C), for neural language models.
给出EA的定义,并且列出几种显著的EA能力,
ICL,无任何附加训练,根据上下文给出的例子完成任务,效果和具体合作任务相关
Instruction tuning,基于多种任务混合的数据集进行fine-tuning,对于未见过的新任务仍然可以解决,具备强泛化能力
CoT,按步骤推理能力
Emergent Abilities of LLMs. In the literature [31], emergent abilities of LLMs are formally defined as “the abilities that are not present in small models but arise in large models”, which is one of the most prominent features that distinguish LLMs from previous PLMs. It further introduces a notable characteristic when emergent abilities occur [31]: performance rises significantly above random when the scale reaches a certain level. By analogy, such an emergent pattern has close connections with the phenomenon of phase transition in physics [31, 58]. In principle, emergent abilities can be defined in relation to some complex tasks [31, 59], while we are more concerned with general abilities that can be applied to solve a variety of tasks.
Here, we briefly introduce three typical emergent abilities for LLMs and representative models that possess such an ability。
In-context learning. The in-context learning (ICL) ability is formally introduced by GPT-3 [55]: assuming that the language model has been provided with a natural language instruction and/or several task demonstrations, it can generate the expected output for the test instances by completing the word sequence of input text, without requiring additional training or gradient update.
Among the GPT-series models, the 175B GPT-3 model exhibited a strong ICL ability in general, but not the GPT-1 and GPT-2 models. While, such an ability also depends on the specific downstream task.
For example, the ICL ability can emerge on the arithmetic tasks (e.g., the 3-digit addition and subtraction) for the 13B GPT-3, but 175B GPT-3 even cannot work well on the Persian QA task
Instruction following. By fine-tuning with a mixture of multi-task datasets formatted via natural language descriptions (called instruction tuning), LLMs are shown to perform well on unseen tasks that are also described in the form of instructions [28, 61, 62]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved generalization ability. According to the experiments in [62], instruction-tuned LaMDA-PT [63] started to significantly outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8B or smaller model sizes. A recent study [64] found that a model size of 62B is at least required for PaLM to perform well on various tasks in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller size might suffice for some specific tasks (e.g., MMLU).
Step-by-step reasoning. For small language models, it is usually difficult to solve complex tasks that involve multiple reasoning steps, e.g., mathematical word problems.
While, with the chain-of-thought (CoT) prompting strategy [33], LLMs can solve such tasks by utilizing the prompting mechanism that involves intermediate reasoning steps for deriving the final answer.
This ability is speculated to be potentially obtained by training on code [33, 47]. An empirical study [33] has shown that CoT prompting can bring performance gains (on arithmetic reasoning benchmarks) when applied to PaLM and LaMDA variants with a model size larger than 60B, while its advantage over the standard prompting becomes more evident when the model size exceeds 100B. Besides, the performance improvement with CoT prompting seems to be also varied for different tasks, e.g., GSM8K > MAWPS > SWAMP for PaLM [33].
LLM的核心技术有哪些?
规模,参数高达175B,540B
训练,如何高效低成本的训练海量训练集,现在LLM训练成本还是非常高的
AE,通过instruction tuning或是ICL,COT来激发LLM的潜在能力
Alignment tuning,典型的是InstructGPT,通过人的反馈,强化学习,让LLM符合人的价值观
工具,通过各种plugin来补全LLM的能力
Key Techniques for LLMs.
Scaling. As discussed in previous parts, there exists an evident scaling effect in Transformer language models: larger model/data sizes and more training compute typically lead to an improved model capacity [30, 34]. As two representative models, GPT-3 and PaLM explored the scaling limits by increasing the model size to 175B and 540B, respectively.
Training. Due to the huge model size, it is very challenging to successfully train a capable LLM.
Distributed training algorithms are needed to learn the network parameters of LLMs, in which various parallel strategies are often jointly utilized.
To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed [65] and Megatron-LM [66–68]. Besides, optimization tricks are also important for training stability and model performance, e.g., restart to overcome training loss spike [56] and mixed precision training [69].
More recently, GPT-4 [46] proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models.
Ability eliciting. After being pre-trained on large-scale corpora, LLMs are endowed with potential abilities as general-purpose task solvers.
While, these abilities might not be explicitly exhibited when LLMs perform some specific tasks. As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities. For instance, chain- of-thought prompting has been shown to be useful to solve complex reasoning tasks by including intermediate reasoning steps. Besides, we can further perform instruction tuning on LLMs with task descriptions expressed in natural language, for improving the generalizability of LLMs on unseen tasks. While, these techniques mainly correspond to the emergent abilities of LLMs, which may not show the same effect on small language models.
Alignment tuning. Since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data), they are likely to generate toxic, biased, or even harmful content for humans. It is necessary to align LLMs with human values, e.g., helpful, honest, and harmless. For this purpose, InstructGPT [61] designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the technique of reinforcement learning with human feedback [61, 70].
Tools manipulation. In essence, LLMs are trained as text generators over massive plain text corpora, thus performing less well on the tasks that are not best expressed in the form of text (e.g., numerical computation). Besides, their capacities are also limited to the pre-training data, e.g., the inability to capture up-to-date information. To tackle these issues, a recently proposed technique is to employ external tools to compensate for the deficiencies of LLMs [71, 72]. For example, LLMs can utilize the calculator for accurate computation [71] and employ search engines to retrieve unknown information [72]. More recently, ChatGPT has enabled the mechanism of using external plugins (existing or newly created apps)9, which are by analogy with the “eyes and ears” of LLMs. Such a mechanism can broadly expand the scope of capacities for LLMs.
Technical Evolution of GPT-series Models
OpenAI,很早就有用语言模型做智能系统的想法,早在用RNN的时候,现在说到RNN好像已经是远古兵器了
Early Explorations. According to one interview with Ilya Sutskever (a co-founder and chief scientist of OpenAI), the idea of approaching intelligent systems with language models was already explored in the early days of OpenAI, while it was attempted with recurrent neural networks (RNN) [104]. With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 [105] and GPT-2 [26], which can considered as the foundation to more powerful models subsequently i.e., GPT-3 and GPT-4.
GPT-1,2017年google提出transform模型后,openAI,2018年发布的
GPT-1. In 2017, the Transformer model [22] was introduced by Google, and the OpenAI team quickly adapted their language modeling work to this new neural network architecture.
They released the first GPT model in 2018, i.e., GPT-1 [105], and coined the abbreviation term GPT as the model name, standing for Generative Pre-Training.
GPT-1 was developed based on a generative, decoder-only Transformer architecture, and adopted a hybrid approach of unsupervised pretraining and supervised fine-tuning.
GPT-1 has set up the core architecture for the GPT-series models and established the underlying principle to model natural language text, i.e., predicting the next word.
GPT-2. 和1一样的架构,只是规模达到1.5B,并用大量的网页进行train
其实2的最大的不同是,提出一个泛化模型的概念,不用fine-tuning,直接用文字接龙的方式去解决各种的问题。
GPT-2. Following a similar architecture of GPT-1, GPT-2 [26] increased the parameter scale to 1.5B, which was trained with a large webpage dataset WebText.
As claimed in the paper of GPT-2, it sought to perform tasks via unsupervised language modeling, without explicit fine-tuning using labeled data.
To motivate the approach, they introduced a probabilistic form for multi-task solving, i.e., p(output|input,task) (similar approaches have been adopted in [106]), which predicts the output conditioned on the input and task information. To model this conditional probability, language text can be naturally employed as a unified way to format input, output and task information.
In this way, the process of solving a task can be cast as a word prediction problem for generating the solution text.
GPT-3. 2000年提出,参数规模达到175B,提出ICL的few-shot或zero-shot的方式。
3的出现标志着从PLM真正的进入到LLM的时代。
GPT-3. GPT-3 [55] was released in 2020, which scaled the model parameters to an ever larger size of 175B. In the GPT-3’s paper, it formally introduced the concept of in-context learning (ICL), which utilizes LLMs in a few-shot or zero-shot way. ICL can teach (or instruct) LLMs to understand the tasks in the form of natural language text.
With ICL, the pre-training and utilization of LLMs converge to the same language modeling paradigm: pre-training predicts the following text sequence conditioned on the context, while ICL predicts the correct task solution, which can be also formatted as a text sequence, given the task description and demonstrations.
GPT-3 not only demonstrates very excellent performance in a variety of NLP tasks but also on a number of specially designed tasks that require the abilities of reasoning or domain adaptation.
Overall, GPT-3 can be viewed as a remarkable landmark in the journey evolving from PLMs to LLMs. It has empirically proved that scaling the neural networks to a significant size can lead to a huge increase in model capacity.
GPT-3.5. 基于3主要做了两点优化
基于code数据的训练,这不光是让LLM具备生成code的能力,还能大幅提升LLM的推理能力,COT的能力。
基于人类反馈的RL强化学习,典型的是InstructGPT
Training on code data. A major limitation of the original GPT-3 model (pre-trained on plain text) lies in the lack of reasoning ability on complex tasks, e.g., completing the code and solving math problems.
To enhance this ability, Codex [89] was introduced by OpenAI in July 2021, which was a GPT model fine-tuned on a large corpus of GitHub code. It demonstrated that Codex can solve very difficult programming problems, and also lead to a significant performance improvement in solving math problems [109].
Further, a contrastive approach [110] to training text and code embedding was reported in January 2022, which was shown to improve a series of related tasks (i.e., linear-probe classification, text search and code search).
Actually, the GPT-3.5 models are developed based on a code-based GPT model (i.e., code-davinci-002), which indicates that training on code data is a very useful practice to improve the model capacity of GPT models, especially the reasoning ability. Besides, there is also speculation that training on code data can greatly increase the chain-of-thought prompting abilities of LLMs [47], while it is still worth further investigation with more thorough verification.
Human alignment. The related research of human alignment can be dated back to the year 2017 (or earlier) for OpenAI: a blog article entitled “learning from human preferences” was posted on the OpenAI blog describing a work that applied reinforcement learning (RL) to learn from the preference comparisons annotated by humans [70] (similar to the reward training step in the aligning algorithm of InstructGPT in Figure 6). Shortly after the release of this RL paper [70], the paper of the Proximal Policy Optimization (PPO) [111] was published in July 2017, which now has been the foundational RL algorithm for learning from human preferences [61]. Later in January 2020, GPT-2 was fine-tuned using the aforementioned RL algorithms [70, 111], which leveraged human preferences to improve the capacities of GPT-2 on NLP tasks. In the same year, another work [112] trained a summarization model for optimizing human preferences in a similar way. Based on this prior work, InstructGPT [61] was proposed in January 2022 to improve the GPT-3 model for human alignment, which formally established a three-stage reinforcement learning from human feedback (RLHF) algorithm. Note that it seems that the wording of “instruction tuning” has seldom been used in OpenAI’s paper and documentation, which is substituted by supervised fine-tuning on human demonstrations (i.e., the first step of the RLHF algorithm [61]). In addition to improving the instruction following capacity, the RLHF algorithm is particularly useful to mitigate the issues of generating harm or toxic content for LLMs, which is key to the safe deployment of LLMs in practice. OpenAI describes their approach to alignment research in a technical article [113], which has summarized three promising directions: “training AI systems to use human feedback, to assist human evaluation and to do alignment research”. These enhancement techniques lead to improved GPT-3 models with stronger capacities, which are called GPT-3.5 models by OpenAI (see the discussion about the OpenAI API in Section 3.1).
ChartGPT基本和InstrctGPT一样,加强基于对话的优化
ChatGPT. In November 2022, OpenAI released the conversation model ChatGPT, based on the GPT models (GPT-3.5 and GPT-4).
As the official blog article introduced [114], ChatGPT was trained in a similar way as InstructGPT (called “a sibling model to InstructGPT” in the original post), while specially optimized for dialogue.
They reported a difference between the training of ChatGPT and InstructGPT in the data collection setup: human-generated conversations (playing both the roles of user and AI) are combined with the InstructGPT dataset in a dialogue format for training ChatGPT. ChatGPT exhibited superior capacities in communicating with humans: possessing a vast store of knowledge, skill at reasoning on mathematical problems, tracing the context accurately in multi-turn dialogues, and aligning well with human values for safe use. Later on, the plugin mechanism has been supported in ChatGPT, which further extends the capacities of ChatGPT with existing tools or apps. So far, it seems to be the ever most powerful chatbot in AI history. The launch of ChatGPT has a significant impact on the AI research in the future, which sheds light on the exploration of human-like AI systems.
GPT-4. 从文本到多模。更强大的能力。更安全
GPT-4. As another remarkable progress, GPT-4 [46] was released in March 2023, which extended the text input to multimodal signals.
Overall, GPT-4 has stronger capacities in solving complex tasks than GPT-3.5, showing a large performance improvement on many evaluation tasks.
A recent study [41] investigated the capacities of GPT-4 by conducting qualitative tests with human-generated problems, spanning a diverse range of difficult tasks, and showed that GPT-4 can achieve more superior performance than prior GPT models such as ChatGPT.
Furthermore, GPT-4 responds more safely to malicious or provocative queries, due to a six-month iterative alignment (with an additional safety reward signal in the RLHF training).
In the technical report, OpenAI has emphasized how to safely develop GPT-4 and applied a number of intervention strategies to mitigate the possible issues of LLMs, such as hallucinations, privacy and overreliance. For example, they introduced the mechanism called read teaming [115] to reduce the harm or toxic content generation. As another important aspect, GPT- 4 has been developed on a well-established deep learning infrastructure with improved optimization methods. They introduced a new mechanism called predictable scaling that can accurately predict the final performance with a small proportion of compute during model training.
RESOURCES OF LLMS
Publicly Available Model Checkpoints or APIs
10Billion规模的模型
Models with Tens of Billions of Parameters.
Most of the models in this category have a parameter scale ranging from 10B to 20B, except LLaMA [57] (containing 65B parameters in the largest version) and NLLB [82] (containing 54.5B parameters in the largest version). Other models within this range include mT5 [74], PanGu-α [75], T0 [28], GPT- NeoX-20B [78], CodeGen [77], UL2 [80], Flan-T5 [64], and mT0 [84].
Among them, Flan-T5 (11B version) can serve as a premier model for research on instruction tuning, since it explores the instruction tuning from three aspects [64]: increasing the number of tasks, scaling the model size, and fine-tuning with chain-of-thought prompting data.
Besides, CodeGen (11B version), as an autoregressive language model designed for generating code, can be considered as a good candidate for exploring the code generation ability.
It also introduces a new benchmark MTPB [77] specially for multi-turn program synthesis, which is composed by 115 expert-generated problems. To solve these problems, it requires LLMs to acquire sufficient programming knowledge (e.g., math, array operations, and algorithms).
As for multilingual tasks, mT0 (13B version) might be a good candidate model, which has been fine-tuned on multilingual tasks with multilingual prompts.
Furthermore, PanGu- α [75] shows good performance in Chinese downstream tasks in zero-shot or few-shot settings, which is developed based on the deep learning framework MindSpore [117]. Note that PanGu-α [75] holds multiple versions of models (up to 200B parameters), while the largest public version has 13B parameters.
As a more recent release, LLaMA (65B version) [57], which contains approximately five times as many parameters as other models, has exhibited superior performance in tasks related to instruction following. Due to the openness and effectiveness, LLaMA has attracted significant attention from the research community, and many efforts [118–121] have been devoted to fine-tuning or continually pre-training its different model versions for implementing new models or tools.
Typically, pre-training models at this scale require hundreds or even thousands of GPUs or TPUs. For instance, GPT-NeoX-20B uses 12 supermicro servers, each equipped with 8 NVIDIA A100- SXM4-40GB GPUs, while LLaMA utilizes 2,048 A100-80G GPUs as reported in their original publications. To accurately estimate the computation resources needed, it is suggested to use the metrics measuring the number of involved computations such as FLOPS (i.e., FLoating point number Operations Per Second) [30].
100Billian基本的LLM,需要上千GPU训练
Models with Hundreds of Billions of Parameters.
For models in this category, only a handful of models have been publicly released. For example, OPT [81], OPT-IML [85], BLOOM [69], and BLOOMZ [84] have nearly the same number of parameters as GPT-3 (175B version), while GLM [83] and Galactica [35] have 130B and 120B parameters, respectively.
Among them, OPT (175B version) has been specially motivated for open sharing, which aims to enable researchers to carry out reproducible research at scale.
For research in cross-lingual generalization, BLOOM (176B version) and BLOOMZ (176B version) can be used as base models, due to the competence in multilingual language modeling tasks.
Among these models, OPT-IML have been tuned with instructions, which might be good candidates for studying the effect of instruction tuning.
Models of this scale typically require thousands of GPUs or TPUs to train. For instance, OPT (175B version) used 992 A100-80GB GPUs, while GLM (130B version) used a cluster of 96 NVIDIA DGX-A100 (8x40G) GPU nodes.
OPENAI的API
Public API of LLMs.
Instead of directly using the model copies, APIs provide a more convenient way for common users to use LLMs, without the need of running the model locally.
As a representative interface for using LLMs, the APIs for the GPT-series models [46, 55, 61, 89] have been widely used for both academia and industry.
OpenAI has provided seven major interfaces to the models in GPT-3 series: ada, babbage, curie, davinci (the most powerful version in GPT-3 series), text-ada-001, text-babbage-001, and text-curie-001.
Among them, the first four interfaces can be further fine-tuned on the host server of OpenAI.
In particular, babbage, curie, and davinci correspond to the GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models, respectively [55].
Besides, there are also two APIs related to Codex [89], called code-cushman-001 (a powerful and multilingual version of the Codex (12B) [89]) and code-davinci-002.
Further, GPT-3.5 series include one base model code-davinci-002 and three enhanced versions, namely text-davinci-002, text-davinci-003, and gpt-3.5-turbo-0301.
It is worth noting that gpt-3.5-turbo-0301 is the interface to invoke Chat-GPT.
More recently, OpenAI has also released the corresponding APIs for GPT-4, including gpt-4, gpt-4-0314, gpt-4-32k, and gpt-4-32k-0314.
Overall, the choice of API interfaces depends on the specific application scenarios and response requirements. The detailed usage can be found on their project websites
Commonly Used Corpora
- GPT-3 (175B) [55] was trained on a mixed dataset of 300B tokens, including CommonCrawl [132], WebText2 [55], Books1 [55], Books2 [55], and Wikipedia [128].
- PaLM (540B) [56] uses a pre-training dataset of 780B tokens, which is sourced from social media conversations, filtered webpages, books, Github, multilingual Wikipedia, and news.
- LLaMA [57] extracts training data from various sources, including CommonCrawl, C4 [73], Github, Wikipedia, books, ArXiv, and StackExchange. The training data size for LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T tokens are used for LLaMA (32B) and LLaMA (65B).
Library Resource
- Transformers [135] is an open-source Python library for building models using the Transformer architecture, which is developed and maintained by Hugging Face. It has a simple and user-friendly API, making it easy to use and customize various pre-trained models. It is a powerful library with a large and active community of users and developers who regularly update and improve the models and algorithms.
- DeepSpeed [65] is a deep learning optimization library (compatible with PyTorch) developed by Microsoft, which has been used to train a number of LLMs, such as MT- NLG [97] and BLOOM [69]. It provides the support of various optimization techniques for distributed training, such as memory optimization (ZeRO technique, gradient checkpointing), and pipeline parallelism.
- Megatron-LM [66–68] is a deep learning library developed by NVIDIA for training large-scale language models. It also provides rich optimization techniques for distributed training, including model and data parallelism, mixed-precision training, and FlashAttention. These optimization techniques can largely improve the training efficiency and speed, enabling efficient distributed training across GPUs.
- JAX [136] is a Python library for high-performance machine learning algorithms developed by Google, allowing users to easily perform computations on arrays with hardware acceleration (e.g., GPU or TPU). It enables efficient computation on various devices and also supports several featured functions, such as automatic differentiation and just-in-time compilation.
- Colossal-AI [137] is a deep learning library developed by HPC-AI Tech for training large-scale AI models. It is implemented based on PyTorch and supports a rich collection of parallel training strategies. Furthermore, it can also optimize heterogeneous memory management with methods proposed by PatrickStar [138]. Recently, a ChatGPT-like model called ColossalChat [121] has been publicly released with two versions (7B and 13B), which are developed using Colossal-AI based on LLaMA [57].
• BMTrain [139] is an efficient library developed by OpenBMB for training models with large-scale parameters in a distributed manner, which emphasizes code simplicity, low resource, and high availability. BMTrain has already incorporated several common LLMs (e.g., Flan-T5 [64] and GLM [83]) into its ModelCenter, where developers can use these models directly.
• FastMoE [140] is a specialized training library for MoE (i.e., mixture-of-experts) models. It is developed based on PyTorch, prioritizing both efficiency and user-friendliness in its design. FastMoE simplifies the process of transferring Transformer models to MoE models and supports both data parallelism and model parallelism during training.
Besides the above library resources, existing deep learning frameworks (e.g., PyTorch [141], TensorFlow [142], MXNet [143], PaddlePaddle [144], MindSpore [117] and OneFlow [145]) have also provided the support for parallel algorithms, which are commonly used for training large- scale models.
PRE-TRAINING
Data Collection
Data Source
The source of pre-training corpus can be broadly categorized into two types: general data and specialized data.
General data, such as webpages, books, and conversational text, is utilized by most LLMs [55, 56, 81] due to its large, diverse, and accessible nature, which can enhance the language modeling and generalization abilities of LLMs. In light of the impressive generalization capabilities exhibited by LLMs, there are also studies that extend their pre-training corpus to more specialized datasets, such as multilingual data, scientific data, and code, endowing LLMs with specific task-solving capabilities [35, 56, 77]. In what follows, we describe these two types of pre-training data sources and their effects on LLMs.
Data Preprocessing
Architecture
In general, the mainstream architectures of existing LLMs can be roughly categorized into three major types, namely encoder-decoder, causal decoder, and prefix decoder.
传统的transformer的架构,同时有encoder和decoder,PLM中有T5和BART使用,LLM使用这种架构的很少
Encoder-decoder Architecture. The vanilla Transformer model is built on the encoder-decoder architecture [22], which consists of two stacks of Transformer blocks as the encoder and decoder, respectively. The encoder adopts stacked multi-head self-attention layers to encode the input sequence for generating its latent representations, while the decoder performs cross-attention on these representations and autoregressively generates the target sequence.
Encoder-decoder PLMs (e.g., T5 [73] and BART [24]) have shown effectiveness on a variety of NLP tasks. So far, there are only a small number of LLMs that are built based on the encoder-decoder architecture, e.g., Flan-T5.
主流架构,decoder采用单向的attention mask,只看过去的上下文;典型的代表是GPT系列,多层的decoder的堆叠
Causal Decoder Architecture. The causal decoder architecture incorporates the unidirectional(单向的) attention mask, to guarantee that each input token can only attend to the past tokens and itself.
The input and output tokens are processed in the same fashion through the decoder.
As representative language models of this architecture, the GPT-series models [26, 55, 105] are developed based on the causal-decoder architecture. In particular, GPT-3 [55] has successfully demonstrated the effectiveness of this architecture, also showing an amazing in-context learning capability of LLMs. Interestingly, GPT-1 [105] and GPT-2 [26] do not exhibit such superior abilities as those in GPT-3, and it seems that scaling plays an important role in increasing the model capacity of this model architecture.
So far, the causal decoders have been widely adopted as the architecture of LLMs by various existing LLMs, such as OPT [81], BLOOM [69], and Gopher [59]. Note that both the causal decoder and prefix decoder discussed next belong to decoder-only architectures. While, when mentioning “decoder-only architecture”, it mainly refers to the causal decoder architecture in existing literature, unless specified.
差别主要在attention的方式上,Causal都是单向的,但是Prefix,对于输入是双向的,对于输出是单向的
Prefix Decoder Architecture. The prefix decoder architecture (a.k.a., non-causal decoder [169]) revises the masking mechanism of causal decoders, to enable performing bidirectional attention over the prefix tokens [170] and unidirectional attention only on generated tokens. In this way, like the encoder-decoder architecture, the prefix decoders can bidirectionally encode the prefix sequence and autoregressively predict the output tokens one by one, where the same parameters are shared during encoding and decoding. Instead of pre-training from scratch, a practical suggestion is to continually train causal decoders and then convert them into prefix decoders for accelerating convergence [29], e.g., U-PaLM [102] is derived from PaLM [56]. Existing representative LLMs based on prefix decoders include GLM- 130B [83] and U-PaLM [102].
Since the launch of Transformer [22], various improvements have been proposed to enhance its training stability, performance, and computational efficiency.
In this part, we will discuss the corresponding configurations for four major parts of the Transformer, including normalization, position embeddings, activation functions, and attention and bias.
Normalization. Training instability is a challenging issue for pre-training LLMs.
To alleviate this problem, layer normalization (Layer Norm, LN) [173] is widely employed in Transformer architectures.
The position of LN is vital to the performance of LLMs. While the initial Transformer [22] uses post-LN, most LLMs employ pre-LN for more stable training in spite of decreasing performance [182].
Activation Functions. To obtain good performance, activation functions also need to be properly set in feed-forward networks.
In existing LLMs, GeLU activations [185] are widely used.
Besides, in the latest LLMs (e.g., PaLM and LaMDA), variants of GLU activation [179, 186] have also been utilized, especially the SwiGLU and GeGLU variants, which often achieve better performance in practice [183].
Position Embeddings. Since the self-attention modules in Transformer are permutation equivariant, position embeddings are employed to inject absolute or relative position information for modeling sequences.
Model Training
ADAPTATION TUNING OF LLMS
In this section, we introduce two major approaches to adapting pre-trained LLMs, namely instruction tuning and alignment tuning.
The former approach mainly aims to enhance (or unlock) the abilities of LLMs, while the latter approach aims to align the behaviors of LLMs with human values or preferences.
这里往后讲的是基于pre-train的模型,进行fine-tuning的技术
Instruction Tuning
In essence, instruction tuning is the approach to fine-tuning pre-trained LLMs on a collection of formatted instances in the form of natural language [62], which is highly related to supervised fine-tuning [61] and multi-task prompted training [28]. In order to perform instruction tuning, we first need to collect or construct instruction-formatted instances. Then, we employ these formatted instances to fine-tune LLMs in a supervised learning way (e.g., training with the sequence-to-sequence loss). After instruction tuning, LLMs can demonstrate superior abilities to generalize to unseen tasks [28, 62, 64], even in a multilingual setting [84].
A recent survey [214] presents a systematic overview of the research on instruction tuning. In comparison to that, we mainly focus on the effect of instruction tuning on LLMs and provide detailed guidelines or strategies for instance collection and tuning. Besides, we also discuss the use of instruction tuning for satisfying the real needs of users, which has been widely applied in existing LLMs, e.g., InstructGPT [61] and GPT-4 [46].
如何构建Instruction,Table6给出现成的数据集,后续描述两种构建方法。
一种是来自已有的NLP打标数据集,将他们转换成task格式
一种是收集人类真实的需求,从OpenAI,QA网站,聊天室收集数据
Formatted Instance Construction
Generally, an instruction-formatted instance consists of a task description (called an instruction), an input-output pair, and a small number of demonstrations (optional).
As important public resources, existing studies have released a large number of labeled data formatted in natural language (see the list of available resources in Table 6).
Next, we introduce two major methods for constructing formatted instances (see an illustration in Figure 5) and then discuss several key factors for instance construction.
The Effect of Instruction Tuning
Performance Improvement. Despite being tuned on a moderate number of instances, instruction tuning has become an important way to improve or unlock the abilities of LLMs [64].
Recent studies have experimented with language models in multiple scales (ranging from 77M to 540B), showing that the models of different scales can all benefit from instruction tuning [64, 217], yielding improved performance as the parameter scale increases [84].
Task Generalization. Instruction tuning encourages the model to understand natural language instructions for task completion. It endows LLMs with the ability (often considered as an emergent ability) to follow human instructions [31] to perform specific tasks without demonstrations, even on unseen tasks [64].
Alignment Tuning
如何让LLM变的可控,主要依赖强化学习,RLHF
由三部分组成,一个待aligned的LM,一个reward模型来学习fb,一个RL算法
Background. LLMs have shown remarkable capabilities in a wide range of NLP tasks [55, 56, 62, 81]. However, these models may sometimes exhibit unintended behaviors, e.g., fabricating false information, pursuing inaccurate objectives, and producing harmful, misleading, and biased expressions [61, 222].
To avert these unexpected behaviors, human alignment has been proposed to make LLMs act in line with human expectations [61, 100]. However, unlike the original pre-training and adaptation tuning (e.g., instruction tuning), such an alignment requires considering very different criteria (e.g., helpfulness, honesty, and harmlessness).
Reinforcement Learning from Human Feedback
To align LLMs with human values, reinforcement learning from human feedback (RLHF) [70, 226] has been proposed to fine-tune LLMs with the collected human feedback data, which is useful to improve the alignment criteria (e.g., helpfulness, honesty, and harmlessness).
RLHF employs reinforcement learning (RL) algorithms (e.g., Proximal Policy Optimization (PPO) [111]) to adapt LLMs to human feedback by learning a reward model.
Such an approach incorporates humans in the training loop for developing well-aligned LLMs, as exemplified by InstructGPT [61].
RLHF System. The RLHF system mainly comprises three key components: a pre-trained LM to be aligned, a reward model learning from human feedback, and a RL algorithm training the LM.
分成如下几步,
第一步,监督学习,比如instruct-tuning
第二步,Reward模型,对于LM输出的结果,让人来标注,用来训练RM可以用来给结果ranking
第三步,用强化学习训练LM,底下给出了强化学习的几要素
Supervised fine-tuning. To make the LM initially perform desired behaviors, it usually needs to collect a supervised dataset containing input prompts (instruction) and desired outputs for fine-tuning the LM. These prompts and outputs can be written by human labelers for some specific tasks
while ensuring the diversity of tasks. For example, Instruct-GPT [61] asks human labelers to compose prompts (e.g., “List five ideas for how to regain enthusiasm for my career”) and desired outputs for several generative tasks such as open QA, brainstorming, chatting, and rewriting. Note that the first step is optional in specific settings or scenarios.
Reward model training. The second step is to train the RM using human feedback data.
Specifically, we employ the LM to generate a certain number of output texts using sampled prompts (from either the supervised dataset or the human-generated prompt) as input. We then invite human labelers to annotate the preference for these pairs. The annotation process can be conducted in multiple forms, and a common approach is to annotate by ranking the generated candidate texts, which can reduce the inconsistency among annotators. Then, the RM is trained to predict the human-preferred output. In InstructGPT, labelers rank model-generated outputs from best to worst, and the RM (i.e., 6B GPT-3) is trained to predict the ranking.
RL fine-tuning. At this step, aligning (i.e., fine-tuning) the LM is formalized as an RL problem.
In this setting, the pre-trained LM acts as the policy that takes as input a prompt and returns an output text, the action space of it is the vocabulary, the state is the currently generated token sequence, and the reward is provided by the RM. To avoid eviating significantly from the initial (before tuning) LM, a penalty term is commonly incorporated into the reward function. For example, InstructGPT optimizes the LM against the RM using the PPO algorithm. For each input prompt, InstructGPT calculates the KL divergence between the generated results from the current LM and the initial LM as the penalty. It is noted that the second and final steps can be iterated in multiple turns for better-aligning LLMs.
Efficient Tuning
In this section, we will discuss how to conduct efficient tuning on LLMs.
We first review several representative parameter-efficient fine-tuning methods for Transformer language models, and then summarize existing work on parameter-efficient fine-tuned LLMs.
Parameter-Efficient Fine-Tuning Methods
有4种fine tuning的方式,adapter tuning, prefix tuning, prompt tuning and LoRA
参考,https://zhuanlan.zhihu.com/p/632009060
Adapter Tuning.
Adapter tuning的思路很直觉,大模型参数太多了,那我就生成一些小的adapter,插入到模型种,finetune的时候只去调整adapter的参数,而大模型的参数不动。
这样效率就搞了,所以后面要考虑的问题就是,adapter的结构怎么设计,adapter插到什么地方
Adapter tuning incorporates small neural network modules (called adapter) into the Transformer models [233].
To implement the adapter module, a bottleneck architecture has been proposed in [233, 234],
which first compresses the original feature vector into a smaller dimension (followed by a nonlinear transformation) and then recovers it to the original dimension.
The adapter module would be integrated into each Transformer layer,
typically using a serial insertion after each of the two core parts (i.e., attention layer and feed-forward layer) of a Transformer layer.
Alternatively, parallel adapters [235] can be also used in Transformer layers,
where it places two adapter modules in parallel with the attention layer and feed-forward layer accordingly.
During fine-tuning, the adapter modules would be optimized according to the specific task goals,
while the parameters of the original language model are frozen in this process. In this way,
we can effectively reduce the number of trainable parameters during fine-tuning.
Prefix Tuning.
既然不能tuning PLM,思路还是要加入一些可以tune参数的部分,这里以前缀向量的方式加入
这些前缀向量是task-specific的,可以认为是虚拟的token embedding
这里需要一种新的方法来tuning前缀向量,这里叫reparameterization trick,这里还提到P-tuning v2的思路类似
下图比较清晰,每个transformer block都会在输入的时候,concat上一个前缀向量
Prefix tuning [230] prepends a sequence of prefixes, which are a set of trainable continuous vectors, to each Transformer layer in language models.
These prefix vectors are task-specific, which can be considered as virtual token embeddings.
To optimize the prefix vectors, a reparameterization trick [230] has been proposed by learning a MLP function
that maps a smaller matrix to the parameter matrix of prefixes, instead of directly optimizing the prefixes.
It has been shown that this trick is useful for stable training.
After optimization, the mapping function would be discarded, and only the derived prefix vectors are kept to enhance task-specific performance.
Since only the prefix parameters would be trained, it can lead to a parameter- efficient model optimization.
Similar to prefix tuning, p-tuning v2 [236] incorporates layer-wise prompt vectors into the Transformer architecture specially for natural language understanding,
which also utilizes multi-task learning for jointly optimizing shared prompts.
It has been shown to be useful in improving the model performance of different parameter scales on natural language understanding tasks.
Prompt Tuning.
prompt tuning,分为hard,soft
hard,可以认为就是人去优化prompt模板,让模型更容易的解决问题
soft,在Prompt中插入一段task-specific的可以tune的prompt token
和prefix tuning的差异是,prompt tuning只在输入层,加入prompt vectors,prefix是每个transformer都加
Different from prefix tuning, prompt tuning [231, 237] mainly focuses on incorporating trainable prompt vectors at the input layer.
Based on the discrete prompting methods [239, 240],
it augments the input text by including a group of soft prompt tokens (either in a free form [237] or a prefix form [231]),
and then takes the prompt-augmented input to solve specific downstream tasks.
In implementation, task-specific prompt embeddings are combined with the input text embeddings, which are subsequently fed into language models.
P-tuning [237] has proposed a free form to combine the context, prompt and target tokens,
which can be applied to the architectures for both natural language understanding and generation.
They further learn the representations of soft prompt tokens by a bidirectional LSTM.
Another representative approach [231] named prompt tuning directly prepends prefix prompts to the input.
During training, only the prompt embeddings would be learned according to task-specific supervisions.
While, since this method only includes a small number of trainable parameters at the input layer,
it has been found that the performance highly relies on the model capacity of the underlying language models [231].
Low-Rank Adaptation (LoRA).
基本思路是大模型的W参数不变
用低秩的分解矩阵来近似和模拟∆W
LoRA [232] imposes the low-rank constraint for approximating the update matrix at each dense layer,
so as to reduce the trainable parameters for adapting to downstream tasks. Consider the case of optimizing a parameter matrix W.
The update process can be written in a general form as: W ← W + ∆W.
The basic idea of LoRA is to freeze the original matrix W ∈ Rm×n while approximating the parameter update ∆W by low-rank decomposition matrices,
i.e., ∆W = A · B⊤, where A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for task adaptation and r ≪ min(m, n) is the reduced rank.
The major merit of LoRA is that it can largely save the memory and storage usage (e.g., VRAM).
Further, one can only keep a single large model copy, while maintaining a number of task-specific low-rank decomposition matrices for adapting to different downstream tasks.
这篇研究会对几个开源的LLM,在多种fine-tuning的效果对比
总体LoRA效果会比较好,在复杂任务上比不上GPT3.5,简单任务可以;
这个比较的规模差太多了,实验模型规模比GPT3.5差了几十倍,这个比较有何意义
Further, an empirical study [234] has been conducted to examine the effect of different tuning methods on language models.
They compare four efficient tuning methods including serial adapter tuning [233], parallel adapter tuning [235, 245], and LoRA [232],
on three open-source LLMs, namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for evaluation.
Based on the experimental results on six math reasoning datasets,
they show that these efficient-tuning methods under-perform the reference baseline GPT-3.5 on difficult tasks,
while achieving a comparable performance on simple tasks.
Overall, LoRA performs relatively well among these comparison methods, using significantly fewer trainable parameters.
As an important resource, the library PEFT [246] (standing for parameter-efficient fine-tuning) has been released on GitHub.
It has included several widely used efficient tuning methods, including LoRA [232]/AdaLoRA [241], prefix- tuning [230, 236], P-Tuning [237], and prompt-tuning [231]. Further, it supports a number of language models such as GPT-2 and LLaMA, and also covers several representative vision Transformer models (e.g., ViT and Swin Transformer).
UTILIZATION
这章主要讲的是如何用大模型,或者说如何成为合格的prompt工程师
In-Context Learning
ICL是GPT3提出的,是一种特殊的prompt形式,
如下图3部分组成,Task description,Demonstration,Query,最终希望得到query的答案
As a special prompting form, in-context learning (ICL) is first proposed along with GPT-3 [55], which has become a typical approach to utilizing LLMs.
"A Survey on In-context Learning" 这篇survey详细的介绍ICL.
本文主要讨论两点,如何设计demo,ICL潜在的机制是什么
A comprehensive review of ICL has been presented in the survey paper [50],
and we suggest the readers referring to it for a more general, detailed discussion on this topic.
Compared with this survey, we specially focus on the discussion of applying ICL to LLMs in two major aspects,
i.e., demonstration design and the underlying mechanism of ICL.
这里说ICL和Instruction tuning有很密切的关系
他们之间最大的区别是Instruction tuning需要fine tuning,ICL只是prompt
而且Instruction tuning可以大幅提升ICL的能力,尤其是对于zero-shot的情况
Besides, ICL also has a close connection with instruction tuning (discussed in Section 5.1) in that both utilize natural language to format the task or instances.
However, instruction tuning needs to fine-tune LLMs for adaptation, while ICL only prompts LLMs for utilization.
Furthermore, instruction tuning can enhance the ICL ability of LLMs to perform target tasks, especially in the zero-shot setting (only using task descriptions) [64
先看demonstration design,这个很大的关系到ICL的性能
分为demonstration selection, format, and order
Demonstration Selection.
The performance of ICL tends to have a large variance with different demonstration examples [250],
so it is important to select a subset of examples that can effectively leverage the ICL capability of LLMs.
启发式的方法,比如通过KNN找和query相近的例子,但这种方法没有考虑例子间的关联性。
新的方法会考虑到例子间的相关性和相异性。
Heuristic approaches. Due to the simplicity and low costs, existing work widely adopts heuristic methods to select demonstrations.
Several studies employ a k-NN based retriever to select examples that are semantically relevant to the query [250, 251].
However, they perform the selection individually for each example, rather than evaluating the example set as a whole.
To resolve this issue, diversity- based selection strategies are proposed to choose the most representative set of examples for specific tasks [252, 253].
Furthermore, in [254], both relevance and diversity are taken into consideration when selecting demonstrations.
基于大模型的方式,可以基于大模型来进行例子的信息增益的度量或是排名,或是用强化学习,用llm作为reward函数
最直接的方式是,用LLM来直接生成demonstration
LLM-based approaches. Another line of work selects demonstrations by making use of LLMs.
For example, LLMs can be utilized to directly measure the informativeness of each example according to the performance gain after adding the example [255].
Besides, EPR [256] proposes a two-stage retrieval approach that first recalls similar examples with an unsupervised method (e.g., BM25)
and then ranks them using a dense retriever (trained with positive and negative examples labeled by LLMs).
As an alternative approach, the task of demonstration selection can be formulated into a RL problem, where LLMs serve as the reward function to provide feedback for training the policy model [257].
Since LLMs perform well for text annotation [258], some recent studies employ LLM itself as the demonstration generator without human interven- tion [259, 260].
那下面看看ICL为何会有效
ICL在GPT3被提出,并且大模型上有要的效果
但是研究发现,在小规模PLM加上特殊设计的训练任务,ICL的效果甚至超过LLM
研究也发现,ICL对于pre-training corpora的依赖超过了模型规模
研究发下,ICL和训练数据的分布也有关系
ICL is first proposed in GPT-3 [55], and it has shown that the ICL ability becomes more significant with a larger model size.
While, some studies reveal that small-scale PLMs can also demonstrate a strong ICL ability with specially designed training tasks
(e.g., learning to predict the label with task examples and the query as the input), and may even surpass larger models [265].
It suggests that the design of training tasks is an important influence factor of the ICL capability of LLMs.
Besides training tasks, recent studies have also investigated the relationship between ICL and the pre-training corpora [261, 266, 267].
It has been shown that the performance of ICL heavily depends on the source of pre-training corpora rather than the scale [267].
Another study [266] provides an in-depth analysis of the impact of training data distribution.
They find that ICL emerges when the training data can be clustered into numerous infrequent classes, instead of being uniformly distributed.
Furthermore, the authors in [261] theoretically explain ICL as the product of pre-training on documents that exhibit long-range coherence.
At the inference stage, researchers focus on analyzing how the ICL capability operates based on given demonstrations since no explicit learning or updating is involved.
They typically analyze from the perspective of gradient descent and consider ICL as implicit fine-tuning [60, 268].
Under this framework, the ICL process can be explained as follows:
by means of forward computation, LLMs generate meta-gradients with respect to demonstrations and implicitly perform gradient descent via the attention mechanism.
Experiments also show that certain attention heads in LLMs are capable of performing task- agnostic atomic operations (e.g., copying and prefix matching), which are closely related to the ICL ability [269, 270]. To further explore the working mechanism of ICL, some studies abstract ICL as an algorithm learning process [271– 273].
Specifically, the authors in [272] find that LLMs essentially encode implicit models through their parameters during pre-training.
With the examples provided in ICL, LLMs can implement learning algorithms such as gradient descent or
directly compute the closed-form solution to update these models during forward computation.
Under this explanation framework, it has been shown that LLMs can effectively learn simple linear functions and even some complex functions
like decision trees with ICL [271–273].
未完