Instruct-GPT
数据收集细节
InstructGPT中数据收集是一个关键的过程,包括收集什么类型的数据、如何筛选标注人员等等。InstructGPT类型的数据是与InstructGPT三阶段训练相对应,而筛选标注人员,则是为了收集的数据质量更高。从下面数据收集细节可以看出为什么要进行标注人员筛选。
标注人员筛选
要求标注人员,针对分布广泛的自然语言提示(prompt)数据集[注1],具备评判能力。其中一部分数据还是敏感的。InstructGPT团队进行了一项测试,以此来筛选出能对敏感内容能以较高倾向进行识别的标注人员。筛选准则如下:
- 模型结果排序:一组提示,每个提示对应多个模型的输出。标注人员需要根据整体质量对模型的输出进行排序。然后与研究人员标注的排序结果进行对比。
- 敏感言论: 创建一组提示及对应回答数据集(prompt, completion),其中一些提示或者回答是敏感的(能引起强烈负面感受,如毒害、黄色、暴力、评判、政治等)。InstructGPT团队自己也标注了这些数据,然后将这些候选标注人员标注的结果与之进行对比。
- 自我评估辨别针对不同群体的敏感言论:希望雇佣能判别广泛敏感内容的标注人员。但是由于法律原因,又不能根据人口统计规则雇佣相关人员。因此要求候选标注人员,填写或回答如“对于哪些主题或者文化群体,您可以轻松识别敏感言论?”,然后将其作为筛选的一部分。
标注说明
在训练数据标注过程中,要求标注人员将有帮助作为重要的准则,高于真实性和无伤害标准。而最终的评估过程中,却要求标注人员优先将真实性和无伤害作为重要的准则。作者也在探索研究途径,在训练过程中,让模型输出优先真实性和无伤害,而不是有帮助。特别是通过拒绝方式,让模型对一些特定的指令拒绝回答。但是这也面临一些挑战:不同的应用具有不同的风险等级,期望模型推理阶段,拒绝回答是可以配置的。此外还存在风险,模型过于概括、拒绝回答无害的指令。这对应大多数应用是不希望看到的。
标注人员人工统计数据
通过向标注人员发送自愿匿名问卷调查,以便了解标注人员的人口统计信息。 说明InstructGPT很重视Bias,从数据集标注这块,减缓让数据标签Bias。
标注页面如图1所示:
- 指定prompt/instruct的模型输出打分 1-7分
- 不同方面打标签: 是否正确执行指令;作为客户助手而言回答是否合适;是否包含色情内容;是否包含暴力内容;是否鼓励或者没有阻拦暴力、虐待、恐怖、自残;是否诋毁受包含类;是否给出了有害建议;是否进行道德评判;
- 同一prompt/instruct的不同模型输出按质量好坏进行排序
[注1]
instruct dataset说明: 形式上有三部分构成:(指令、输入、输出)或者(instruct, input, output)。
举例说明:
- instruct: 请以下面几个词语为主题写一篇不少于800字的文章
- input: 助人为乐、见义勇为
- output: xxx
但是指令与输入并没有明显的区分,可以不指定input或者input为空。
举例说明:
- instruct: 请以助人为乐、见义勇为为主题写一篇不少于800字的文章
- input: ""
- output: xxx
instruct dataset 是如何获取到的?
instruct: 用户提交到API中的,标注人员编写,这些都是人工生成;还有就是也可以由模型生成如self-instruct中介绍的方法
ouput:也是模型生成+人工编写
模型
TODO
模型评估
TODO
InstructGPT 概括说明
越大的NLP模型并未更好地理解用户的意图,比如模型输出的信息不真实、有害、有偏见或者没有帮助。而这种状况并没有随着模型变大而改善。因此InstructGPT一文旨在改善这种状况,并将模型理解用户意图并按照现在社会的法律和道德规范输出,这种能力称之为与人类对齐。下面分三步来让模型具备这种能力,如图2所示。
- Step1: 选择指令数据:其中instruct样本来自用户提交到API上的,以及标注人员人工编写的。对应的input-output则是标注人员人工编写的。然后基于此种数据集微调GPT-3。
- Step2: 基于大量的API instructs, 对应每一条instruct都由不同模型生成多个不同的output,然后由人工进行标注排序。基于这种数据集,训练reward model。
- Step3: 使用上述RM模式评估GPT-3,通过强化学习不断优化模型output 满足RM模型较高评价分数。
标注人员写的三种instruct数据
- 简单的: 仅要求标注人员任意写instruct示例,但要求这些示例要足够多样
- 一对多:要求标注人员写instruct示例,同时要求写出与该instruct示例想对应的多个(input, output)样本
- 基于API的:基于用户提交到OpenAI API中的instruct示例,要求标注人员对应其中的每个instruct示例,都写出与之相似的或相同含义的示例。
图3 是对提交到API的instruct样本进行统计分类如下图所示。大多数的instruct样本是生产式的,而不是分类或者问答类问题。
表1 是来自InstructGPT分布的用户提交prompt示例。我对比看了来自GPT3分布的用户提交的prompt示例,但是感觉不出来区别。
Use Case | Example |
---|---|
brainstorming | List five ideas for how to regain enthusiasm for my career |
brainstorming | What are some key points I should know when studying Ancient Greece? |
brainstorming | What are 4 questions a user might have after reading the instruction manual for a trash compactor? {user manual} 1. |
brainstorming | What are 10 science fiction books I should read next? |
classification | Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation {text} Rating: |
classification | This is a list of tweets and the sentiment categories they fall into. Tweet: {tweet_content1} Sentiment: {sentiment1} Tweet: {tweet_content2} Sentiment: |
classification | {java code} What language is the code above written in? |
classification | You are a very serious professor, and you check papers to see if they contain missing citations. Given the text, say whether it is missing an important citation (YES/NO) and which sentence(s) require citing. |
extract | Extract all course titles from the table below: |
extract | Extract all place names from the article below: |
extract | Given the following list of movie titles, write down any names of cities in the titles. |
generation | Write a creative ad for the following product to run on Facebook aimed at parents: Product: |
generation | Write a short story where a brown bear to the beach, makes friends with a seal, and then return home. |
generation | Here’s a message to me: — {email} — Here are some bullet points for a reply: — {message} — Write a detailed reply |
generation | This is an article about how to write a cover letter when applying for jobs: — It’s important to spend some time |
generation | write rap lyrics on the topics mentioned in this news article: —- {article} —- |
rewrite | This is the summary of a Broadway play: """ {summary} """ This is the outline of the commercial for that play: """ |
rewrite | Translate this sentence to Spanish: |
rewrite | Create turn-by-turn navigation given this text: Go west on {road1} unto you hit {road2}.Desination will be a red barn on the right then take it east to {road3}. 1. |
rewrite | Rewrite the following text to be more light-hearted: — {very formal text} — |
chat | The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. Human: Hello, who are you? AI: I am an AI created by OpenAI. How can I help you today? Human: I’d like to cancel my subscription. AI: |
chat | Marv is a chatbot that reluctantly answers questions with sarcastic responses: You: How many pounds are in a kilogram? Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this. You: What does HTML stand for? Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future. You: When did the first airplane fly? Marv: |
chat | This is a conversation with an enlightened Buddha. Every response is full of wisdom and love. Me: How can I achieve greater peace and equanimity? Buddha: |
closed qa | Help me answer questions about the following short story: {story} What is the moral of the story? |
closed qa | Answer the following question: What shape is the earth? A) A circle B) A sphere C) An ellipse D) A plane |
closed qa | Tell me how hydrogen and helium are different, using the following facts: |
open qa | I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown". Q: What is human life expectancy in the United States? A: Human life expectancy in the United States is 78 years. Q: Who was president of the United States in 1955? A: |
open qa | Who built the statue of liberty? |
open qa | How do you take the derivative of the sin function? |
open qa | who are the indiginous people of New Zealand? |
summarization | Summarize this for a second-grade student: |
summarization | {news article} Tl;dr: |
summarization | {chat transcript} Summarize the above conversation between a customer and customer assistant. Make sure to state any complaints that the customer has. |
other | start with where |
other | Look up "cowboy" on Google and give me the results. |
other | Johnathan Silver goes to the market every day, and brings back a |