openai官网提示词
2024-05-15 00:20 l_v_y_forever 阅读(155) 评论(0) 编辑 收藏 举报Prompt engineering 提示工程
This guide shares strategies and tactics for getting better results from large language models (sometimes referred to as GPT models) like GPT-4o. The methods described here can sometimes be deployed in combination for greater effect. We encourage experimentation to find the methods that work best for you.
本指南分享了从 GPT-4o 等大型语言模型(有时称为 GPT 模型)中获得更好结果的策略和战术。此处描述的方法有时可以组合部署以获得更大的效果。我们鼓励您进行实验,以找到最适合您的方法。
You can also explore example prompts which showcase what our models are capable of:
您还可以浏览示例提示,这些提示展示了我们的模型的功能:
浏览提示示例,了解 GPT 模型可以做什么
Six strategies for getting better results
获得更好结果的六种策略
Write clear instructions
写清楚的说明
These models can’t read your mind. If outputs are too long, ask for brief replies. If outputs are too simple, ask for expert-level writing. If you dislike the format, demonstrate the format you’d like to see. The less the model has to guess at what you want, the more likely you’ll get it.
这些模型无法读懂你的想法。如果输出太长,请要求简短回复。如果输出太简单,请要求专家级写作。如果您不喜欢这种格式,请演示您希望看到的格式。模型对你想要什么的猜测越少,你得到它的可能性就越大。
Tactics: 策略:
- Include details in your query to get more relevant answers
在查询中包含详细信息以获得更相关的答案 - Ask the model to adopt a persona
要求模型采用角色 - Use delimiters to clearly indicate distinct parts of the input
使用分隔符清楚地指示输入的不同部分 - Specify the steps required to complete a task
指定完成任务所需的步骤 - Provide examples 举例说明
- Specify the desired length of the output
指定所需的输出长度
Provide reference text 提供参考文本
Language models can confidently invent fake answers, especially when asked about esoteric topics or for citations and URLs. In the same way that a sheet of notes can help a student do better on a test, providing reference text to these models can help in answering with fewer fabrications.
语言模型可以自信地发明虚假答案,尤其是在被问及深奥的话题或引文和 URL 时。就像一张笔记可以帮助学生在考试中做得更好一样,为这些模型提供参考文本可以帮助回答更少的捏造。
Tactics: 策略:
- Instruct the model to answer using a reference text
指示模型使用参考文本回答问题 - Instruct the model to answer with citations from a reference text
指示模型使用参考文本的引用来回答
Split complex tasks into simpler subtasks
将复杂的任务拆分为更简单的子任务
Just as it is good practice in software engineering to decompose a complex system into a set of modular components, the same is true of tasks submitted to a language model. Complex tasks tend to have higher error rates than simpler tasks. Furthermore, complex tasks can often be re-defined as a workflow of simpler tasks in which the outputs of earlier tasks are used to construct the inputs to later tasks.
正如软件工程中的良好做法是将复杂系统分解为一组模块化组件一样,提交给语言模型的任务也是如此。复杂任务往往比简单任务具有更高的错误率。此外,复杂任务通常可以重新定义为更简单任务的工作流,其中早期任务的输出用于构造后续任务的输入。
Tactics: 策略:
- Use intent classification to identify the most relevant instructions for a user query
使用意图分类来标识与用户查询最相关的指令 - For dialogue applications that require very long conversations, summarize or filter previous dialogue
对于需要很长对话的对话应用程序,请总结或过滤以前的对话 - Summarize long documents piecewise and construct a full summary recursively
分段总结长文档,并以递归方式构建完整的摘要
Give the model time to "think"
给模型时间“思考”
If asked to multiply 17 by 28, you might not know it instantly, but can still work it out with time. Similarly, models make more reasoning errors when trying to answer right away, rather than taking time to work out an answer. Asking for a "chain of thought" before an answer can help the model reason its way toward correct answers more reliably.
如果要求将 17 乘以 28,您可能不会立即知道,但仍然可以随着时间的推移计算出来。同样,模型在试图立即回答时会犯更多的推理错误,而不是花时间找出答案。在回答之前要求一个“思维链”可以帮助模型更可靠地推理出正确的答案。
Tactics: 策略:
- Instruct the model to work out its own solution before rushing to a conclusion
在匆忙得出结论之前,指示模型制定自己的解决方案 - Use inner monologue or a sequence of queries to hide the model's reasoning process
使用内心独白或一系列查询来隐藏模型的推理过程 - Ask the model if it missed anything on previous passes
询问模型在之前的刀路中是否遗漏了任何内容
Use external tools 使用外部工具
Compensate for the weaknesses of the model by feeding it the outputs of other tools. For example, a text retrieval system (sometimes called RAG or retrieval augmented generation) can tell the model about relevant documents. A code execution engine like OpenAI's Code Interpreter can help the model do math and run code. If a task can be done more reliably or efficiently by a tool rather than by a language model, offload it to get the best of both.
通过向模型提供其他工具的输出来补偿模型的弱点。例如,文本检索系统(有时称为 RAG 或检索增强生成)可以告诉模型有关相关文档的信息。像 OpenAI 的 Code Interpreter 这样的代码执行引擎可以帮助模型进行数学运算和运行代码。如果一项任务可以通过工具而不是语言模型更可靠或更高效地完成,请卸载它以充分利用两者。
Tactics: 策略:
- Use embeddings-based search to implement efficient knowledge retrieval
使用基于嵌入的搜索实现高效的知识检索 - Use code execution to perform more accurate calculations or call external APIs
使用代码执行来执行更准确的计算或调用外部 API - Give the model access to specific functions
授予模型对特定函数的访问权限
Test changes systematically
系统地测试更改
Improving performance is easier if you can measure it. In some cases a modification to a prompt will achieve better performance on a few isolated examples but lead to worse overall performance on a more representative set of examples. Therefore to be sure that a change is net positive to performance it may be necessary to define a comprehensive test suite (also known an as an "eval").
如果可以衡量性能,则更容易提高性能。在某些情况下,对提示的修改将在几个孤立的示例上获得更好的性能,但在更具代表性的示例集上会导致整体性能较差。因此,为了确保更改对性能有净积极影响,可能需要定义一个全面的测试套件(也称为“eval”)。
Tactic: 策略:
Tactics 策略
Each of the strategies listed above can be instantiated with specific tactics. These tactics are meant to provide ideas for things to try. They are by no means fully comprehensive, and you should feel free to try creative ideas not represented here.
上面列出的每种策略都可以用特定的策略进行实例化。这些策略旨在为尝试事物提供想法。它们绝不是完全全面的,您应该随意尝试此处未代表的创意。
Strategy: Write clear instructions
策略:写清楚的指令
Tactic: Include details in your query to get more relevant answers
策略:在查询中包含详细信息以获得更相关的答案
In order to get a highly relevant response, make sure that requests provide any important details or context. Otherwise you are leaving it up to the model to guess what you mean.
为了获得高度相关的响应,请确保请求提供任何重要的详细信息或上下文。否则,你就要让模型来猜测你的意思了。
Worse | Better |
How do I add numbers in Excel? 如何在 Excel 中添加数字? |
How do I add up a row of dollar amounts in Excel? I want to do this automatically for a whole sheet of rows with all the totals ending up on the right in a column called "Total". 如何在Excel中将一行美元金额相加?我想对一整行自动执行此操作,所有总数都位于名为“总计”的列的右侧。 |
Who’s president? 谁是总统? | Who was the president of Mexico in 2021, and how frequently are elections held? 谁是 2021 年墨西哥总统,选举多久举行一次? |
Write code to calculate the Fibonacci sequence. 编写代码来计算斐波那契数列。 |
Write a TypeScript function to efficiently calculate the Fibonacci sequence. Comment the code liberally to explain what each piece does and why it's written that way. 编写一个 TypeScript 函数来有效地计算斐波那契数列。对代码进行大量注释,以解释每篇文章的作用以及为什么这样编写。 |
Summarize the meeting notes. 总结会议记录。 |
Summarize the meeting notes in a single paragraph. Then write a markdown list of the speakers and each of their key points. Finally, list the next steps or action items suggested by the speakers, if any. 在单个段落中总结会议记录。然后写一个演讲者的降价列表和他们的每个关键点。最后,列出演讲者建议的后续步骤或行动项目(如果有)。 |
Tactic: Ask the model to adopt a persona
策略:要求模特采用角色
The system message can be used to specify the persona used by the model in its replies.
系统消息可用于指定模型在其回复中使用的角色。
Tactic: Use delimiters to clearly indicate distinct parts of the input
策略:使用分隔符清楚地指示输入的不同部分
Delimiters like triple quotation marks, XML tags, section titles, etc. can help demarcate sections of text to be treated differently.
三引号、XML 标记、章节标题等分隔符可以帮助划分要区别对待的文本部分。
For straightforward tasks such as these, using delimiters might not make a difference in the output quality. However, the more complex a task is the more important it is to disambiguate task details. Don’t make the model work to understand exactly what you are asking of them.
对于此类简单明了的任务,使用分隔符可能不会对输出质量产生影响。但是,任务越复杂,消除任务细节的歧义就越重要。不要让模型工作以准确理解您对他们的要求。
Tactic: Specify the steps required to complete a task
策略:指定完成任务所需的步骤
Some tasks are best specified as a sequence of steps. Writing the steps out explicitly can make it easier for the model to follow them.
某些任务最好指定为一系列步骤。显式写出步骤可以使模型更容易遵循它们。
Tactic: Provide examples
策略:举例说明
Providing general instructions that apply to all examples is generally more efficient than demonstrating all permutations of a task by example, but in some cases providing examples may be easier. For example, if you intend for the model to copy a particular style of responding to user queries which is difficult to describe explicitly. This is known as "few-shot" prompting.
提供适用于所有示例的一般说明通常比通过示例演示任务的所有排列更有效,但在某些情况下,提供示例可能更容易。例如,如果您打算让模型复制一种难以明确描述的响应用户查询的特定样式。这称为“少镜头”提示。
Tactic: Specify the desired length of the output
策略:指定所需的输出长度
You can ask the model to produce outputs that are of a given target length. The targeted output length can be specified in terms of the count of words, sentences, paragraphs, bullet points, etc. Note however that instructing the model to generate a specific number of words does not work with high precision. The model can more reliably generate outputs with a specific number of paragraphs or bullet points.
您可以要求模型生成给定目标长度的输出。目标输出长度可以根据单词、句子、段落、项目符号等的数量来指定。但请注意,指示模型生成特定数量的单词并不能高精度地工作。该模型可以更可靠地生成具有特定段落或项目符号的输出。
Strategy: Provide reference text
策略:提供参考文本
Tactic: Instruct the model to answer using a reference text
策略:指示模型使用参考文本进行回答
If we can provide a model with trusted information that is relevant to the current query, then we can instruct the model to use the provided information to compose its answer.
如果我们可以为模型提供与当前查询相关的可信信息,那么我们可以指示模型使用提供的信息来编写其答案。
Given that all models have limited context windows, we need some way to dynamically lookup information that is relevant to the question being asked. Embeddings can be used to implement efficient knowledge retrieval. See the tactic "Use embeddings-based search to implement efficient knowledge retrieval" for more details on how to implement this.
鉴于所有模型的上下文窗口都有限,我们需要某种方法来动态查找与所提出问题相关的信息。嵌入可用于实现高效的知识检索。有关如何实现此操作的更多详细信息,请参阅策略“使用基于嵌入的搜索实现高效的知识检索”。
Tactic: Instruct the model to answer with citations from a reference text
策略:指示模型使用参考文本的引用来回答
If the input has been supplemented with relevant knowledge, it's straightforward to request that the model add citations to its answers by referencing passages from provided documents. Note that citations in the output can then be verified programmatically by string matching within the provided documents.
如果输入已补充了相关知识,则直接要求模型通过引用所提供文档中的段落来在其答案中添加引用。请注意,输出中的引文可以通过所提供文档中的字符串匹配以编程方式进行验证。
Strategy: Split complex tasks into simpler subtasks
策略:将复杂的任务拆分为更简单的子任务
Tactic: Use intent classification to identify the most relevant instructions for a user query
策略:使用意图分类来识别与用户查询最相关的指令
For tasks in which lots of independent sets of instructions are needed to handle different cases, it can be beneficial to first classify the type of query and to use that classification to determine which instructions are needed. This can be achieved by defining fixed categories and hardcoding instructions that are relevant for handling tasks in a given category. This process can also be applied recursively to decompose a task into a sequence of stages. The advantage of this approach is that each query will contain only those instructions that are required to perform the next stage of a task which can result in lower error rates compared to using a single query to perform the whole task. This can also result in lower costs since larger prompts cost more to run (see pricing information).
对于需要大量独立指令集来处理不同情况的任务,首先对查询类型进行分类并使用该分类来确定需要哪些指令可能是有益的。这可以通过定义与处理给定类别中的任务相关的固定类别和硬编码指令来实现。此过程也可以递归地应用,以将任务分解为一系列阶段。此方法的优点是,每个查询将仅包含执行任务下一阶段所需的指令,与使用单个查询执行整个任务相比,这可能会导致更低的错误率。这还可以降低成本,因为较大的提示运行成本更高(请参阅定价信息)。
Suppose for example that for a customer service application, queries could be usefully classified as follows:
例如,假设对于客户服务应用程序,查询可以按以下方式进行有用的分类:
Based on the classification of the customer query, a set of more specific instructions can be provided to a model for it to handle next steps. For example, suppose the customer requires help with "troubleshooting".
根据客户查询的分类,可以向模型提供一组更具体的指令,以便其处理后续步骤。例如,假设客户需要“故障排除”方面的帮助。
Notice that the model has been instructed to emit special strings to indicate when the state of the conversation changes. This enables us to turn our system into a state machine where the state determines which instructions are injected. By keeping track of state, what instructions are relevant at that state, and also optionally what state transitions are allowed from that state, we can put guardrails around the user experience that would be hard to achieve with a less structured approach.
请注意,已指示模型发出特殊字符串,以指示会话状态何时更改。这使我们能够将我们的系统变成一个状态机,其中状态决定了注入哪些指令。通过跟踪状态、在该状态下相关的指令,以及可选地允许从该状态转换哪些状态,我们可以在用户体验周围设置护栏,而这些护栏是用结构化程度较低的方法难以实现的。
Tactic: For dialogue applications that require very long conversations, summarize or filter previous dialogue
策略:对于需要很长对话的对话应用程序,总结或过滤之前的对话
Since models have a fixed context length, dialogue between a user and an assistant in which the entire conversation is included in the context window cannot continue indefinitely.
由于模型具有固定的上下文长度,因此用户和助手之间的对话(其中整个对话都包含在上下文窗口中)不能无限期地继续。
There are various workarounds to this problem, one of which is to summarize previous turns in the conversation. Once the size of the input reaches a predetermined threshold length, this could trigger a query that summarizes part of the conversation and the summary of the prior conversation could be included as part of the system message. Alternatively, prior conversation could be summarized asynchronously in the background throughout the entire conversation.
这个问题有多种解决方法,其中之一是总结对话中的前几轮。一旦输入的大小达到预定的阈值长度,这可能会触发一个查询,该查询汇总了部分对话,并且可以将先前对话的摘要作为系统消息的一部分包含在内。或者,可以在整个对话过程中在后台异步总结先前的对话。
An alternative solution is to dynamically select previous parts of the conversation that are most relevant to the current query. See the tactic "Use embeddings-based search to implement efficient knowledge retrieval".
另一种解决方案是动态选择与当前查询最相关的对话的先前部分。请参阅策略“使用基于嵌入的搜索实现高效的知识检索”。
Tactic: Summarize long documents piecewise and construct a full summary recursively
策略:分段总结长文档,并递归地构建完整的摘要
Since models have a fixed context length, they cannot be used to summarize a text longer than the context length minus the length of the generated summary in a single query.
由于模型具有固定的上下文长度,因此它们不能用于汇总文本,其长度超过上下文长度减去单个查询中生成的摘要的长度。
To summarize a very long document such as a book we can use a sequence of queries to summarize each section of the document. Section summaries can be concatenated and summarized producing summaries of summaries. This process can proceed recursively until an entire document is summarized. If it’s necessary to use information about earlier sections in order to make sense of later sections, then a further trick that can be useful is to include a running summary of the text that precedes any given point in the book while summarizing content at that point. The effectiveness of this procedure for summarizing books has been studied in previous research by OpenAI using variants of GPT-3.
要总结一个很长的文档,比如一本书,我们可以使用一系列查询来总结文档的每个部分。章节摘要可以串联和汇总,从而生成摘要的摘要。此过程可以递归进行,直到汇总整个文档。如果有必要使用有关前面部分的信息来理解后面的部分,那么另一个有用的技巧是包括书中任何给定点之前的文本的运行摘要,同时总结该点的内容。OpenAI 在之前使用 GPT-3 变体的研究中研究了此程序总结书籍的有效性。
Strategy: Give models time to "think"
策略:给模型时间“思考”
Tactic: Instruct the model to work out its own solution before rushing to a conclusion
策略:在匆忙得出结论之前,指示模型制定自己的解决方案
Sometimes we get better results when we explicitly instruct the model to reason from first principles before coming to a conclusion. Suppose for example we want a model to evaluate a student’s solution to a math problem. The most obvious way to approach this is to simply ask the model if the student's solution is correct or not.
有时,当我们明确指示模型在得出结论之前从第一性原理进行推理时,我们会得到更好的结果。例如,假设我们想要一个模型来评估学生对数学问题的解决方案。解决这个问题的最明显方法是简单地询问模型学生的解决方案是否正确。
But the student's solution is actually not correct! We can get the model to successfully notice this by prompting the model to generate its own solution first.
但学生的解法其实是不对的!我们可以通过提示模型首先生成自己的解决方案来让模型成功注意到这一点。
Tactic: Use inner monologue or a sequence of queries to hide the model's reasoning process
策略:使用内心独白或一系列查询来隐藏模型的推理过程
The previous tactic demonstrates that it is sometimes important for the model to reason in detail about a problem before answering a specific question. For some applications, the reasoning process that a model uses to arrive at a final answer would be inappropriate to share with the user. For example, in tutoring applications we may want to encourage students to work out their own answers, but a model’s reasoning process about the student’s solution could reveal the answer to the student.
前面的策略表明,模型在回答特定问题之前详细推理问题有时很重要。对于某些应用程序,模型用于得出最终答案的推理过程不适合与用户共享。例如,在辅导应用程序中,我们可能希望鼓励学生找出自己的答案,但模型对学生解决方案的推理过程可能会向学生揭示答案。
Inner monologue is a tactic that can be used to mitigate this. The idea of inner monologue is to instruct the model to put parts of the output that are meant to be hidden from the user into a structured format that makes parsing them easy. Then before presenting the output to the user, the output is parsed and only part of the output is made visible.
内心独白是一种可以用来缓解这种情况的策略。内心独白的思想是指示模型将输出中对用户隐藏的部分放入结构化格式中,以便于解析它们。然后,在向用户呈现输出之前,将解析输出,并且仅显示部分输出。
Alternatively, this can be achieved with a sequence of queries in which all except the last have their output hidden from the end user.
或者,这可以通过一系列查询来实现,其中除最后一个查询外的所有查询都对最终用户隐藏其输出。
First, we can ask the model to solve the problem on its own. Since this initial query doesn't require the student’s solution, it can be omitted. This provides the additional advantage that there is no chance that the model’s solution will be biased by the student’s attempted solution.
首先,我们可以要求模型自行解决问题。由于此初始查询不需要学生的解决方案,因此可以省略。这提供了额外的优势,即模型的解决方案不会因学生尝试的解决方案而产生偏差。
Next, we can have the model use all available information to assess the correctness of the student’s solution.
接下来,我们可以让模型使用所有可用信息来评估学生解决方案的正确性。
Finally, we can let the model use its own analysis to construct a reply in the persona of a helpful tutor.
最后,我们可以让模型使用自己的分析来构建一个乐于助人的导师的角色的回复。
Tactic: Ask the model if it missed anything on previous passes
策略:询问模型在之前的传递中是否遗漏了任何内容
Suppose that we are using a model to list excerpts from a source which are relevant to a particular question. After listing each excerpt the model needs to determine if it should start writing another or if it should stop. If the source document is large, it is common for a model to stop too early and fail to list all relevant excerpts. In that case, better performance can often be obtained by prompting the model with followup queries to find any excerpts it missed on previous passes.
假设我们正在使用一个模型来列出与特定问题相关的来源摘录。在列出每个摘录后,模型需要确定是否应该开始编写另一个摘录,或者是否应该停止编写另一个摘录。如果源文档很大,则模型通常会过早停止并且无法列出所有相关摘录。在这种情况下,通常可以通过提示模型进行后续查询来查找它在先前传递中遗漏的任何摘录来获得更好的性能。
Strategy: Use external tools
策略:使用外部工具
Tactic: Use embeddings-based search to implement efficient knowledge retrieval
策略:使用基于嵌入的搜索实现高效的知识检索
A model can leverage external sources of information if provided as part of its input. This can help the model to generate more informed and up-to-date responses. For example, if a user asks a question about a specific movie, it may be useful to add high quality information about the movie (e.g. actors, director, etc…) to the model’s input. Embeddings can be used to implement efficient knowledge retrieval, so that relevant information can be added to the model input dynamically at run-time.
如果模型作为其输入的一部分提供,则可以利用外部信息源。这可以帮助模型生成更明智和最新的响应。例如,如果用户询问有关特定电影的问题,则将有关电影的高质量信息(例如演员、导演等)添加到模型的输入中可能很有用。嵌入可用于实现高效的知识检索,以便在运行时将相关信息动态添加到模型输入中。
A text embedding is a vector that can measure the relatedness between text strings. Similar or relevant strings will be closer together than unrelated strings. This fact, along with the existence of fast vector search algorithms means that embeddings can be used to implement efficient knowledge retrieval. In particular, a text corpus can be split up into chunks, and each chunk can be embedded and stored. Then a given query can be embedded and vector search can be performed to find the embedded chunks of text from the corpus that are most related to the query (i.e. closest together in the embedding space).
文本嵌入是一种向量,可以测量文本字符串之间的相关性。相似或相关的字符串将比不相关的字符串更接近。这一事实,以及快速向量搜索算法的存在,意味着嵌入可用于实现高效的知识检索。特别是,文本语料库可以拆分为块,并且每个块都可以嵌入和存储。然后,可以嵌入给定的查询,并执行向量搜索,以从语料库中找到与查询最相关的嵌入文本块(即在嵌入空间中最接近的文本块)。
Example implementations can be found in the OpenAI Cookbook. See the tactic “Instruct the model to use retrieved knowledge to answer queries” for an example of how to use knowledge retrieval to minimize the likelihood that a model will make up incorrect facts.
可以在 OpenAI Cookbook 中找到示例实现。有关如何使用知识检索来最大程度地减少模型编造错误事实的可能性的示例,请参阅策略“指示模型使用检索到的知识来回答查询”。
Tactic: Use code execution to perform more accurate calculations or call external APIs
策略:使用代码执行来执行更准确的计算或调用外部 API
Language models cannot be relied upon to perform arithmetic or long calculations accurately on their own. In cases where this is needed, a model can be instructed to write and run code instead of making its own calculations. In particular, a model can be instructed to put code that is meant to be run into a designated format such as triple backtick. After an output is produced, the code can be extracted and run. Finally, if necessary, the output from the code execution engine (i.e. Python interpreter) can be provided as an input to the model for the next query.
不能依靠语言模型本身准确地执行算术或长计算。在需要的情况下,可以指示模型编写和运行代码,而不是进行自己的计算。特别是,可以指示模型将要运行的代码转换为指定的格式,例如三重反引号。生成输出后,可以提取并运行代码。最后,如有必要,代码执行引擎(即 Python 解释器)的输出可以作为下一个查询的模型输入提供。
Another good use case for code execution is calling external APIs. If a model is instructed in the proper use of an API, it can write code that makes use of it. A model can be instructed in how to use an API by providing it with documentation and/or code samples showing how to use the API.
代码执行的另一个很好的用例是调用外部 API。如果指示模型正确使用 API,则可以编写使用它的代码。通过向模型提供演示如何使用 API 的文档和/或代码示例,可以指导模型如何使用 API。
WARNING: Executing code produced by a model is not inherently safe and precautions should be taken in any application that seeks to do this. In particular, a sandboxed code execution environment is needed to limit the harm that untrusted code could cause.
警告:执行模型生成的代码本身并不安全,在任何试图执行此操作的应用程序中都应采取预防措施。特别是,需要沙盒代码执行环境来限制不受信任的代码可能造成的危害。
Tactic: Give the model access to specific functions
策略:授予模型对特定函数的访问权限
The Chat Completions API allows passing a list of function descriptions in requests. This enables models to generate function arguments according to the provided schemas. Generated function arguments are returned by the API in JSON format and can be used to execute function calls. Output provided by function calls can then be fed back into a model in the following request to close the loop. This is the recommended way of using OpenAI models to call external functions. To learn more see the function calling section in our introductory text generation guide and more function calling examples in the OpenAI Cookbook.
聊天完成 API 允许在请求中传递函数描述列表。这使模型能够根据提供的架构生成函数参数。生成的函数参数由 API 以 JSON 格式返回,可用于执行函数调用。然后,可以在以下请求中将函数调用提供的输出反馈到模型中以关闭循环。这是使用 OpenAI 模型调用外部函数的推荐方式。要了解更多信息,请参阅我们的介绍性文本生成指南中的函数调用部分,以及 OpenAI Cookbook 中的更多函数调用示例。
Strategy: Test changes systematically
策略:系统地测试变更
Sometimes it can be hard to tell whether a change — e.g., a new instruction or a new design — makes your system better or worse. Looking at a few examples may hint at which is better, but with small sample sizes it can be hard to distinguish between a true improvement or random luck. Maybe the change helps performance on some inputs, but hurts performance on others.
有时很难判断更改(例如,新指令或新设计)是使您的系统变得更好还是更糟。看几个例子可能会暗示哪个更好,但是由于样本量很小,很难区分真正的改进还是随机运气。也许这种变化有助于某些输入的性能,但会损害其他输入的性能。
Evaluation procedures (or "evals") are useful for optimizing system designs. Good evals are:
评估程序(或“评估”)对于优化系统设计非常有用。好的评估是:
- Representative of real-world usage (or at least diverse)
代表实际使用(或至少多样化) - Contain many test cases for greater statistical power (see table below for guidelines)
包含许多测试用例以获得更高的统计功效(有关指南,请参阅下表) - Easy to automate or repeat
易于自动化或重复
Difference to detect 要检测的差异 | Sample size needed for 95% confidence 95% 置信度所需的样本量 |
---|---|
30% | ~10 |
10% | ~100 |
3% | ~1,000 |
1% | ~10,000 |
Evaluation of outputs can be done by computers, humans, or a mix. Computers can automate evals with objective criteria (e.g., questions with single correct answers) as well as some subjective or fuzzy criteria, in which model outputs are evaluated by other model queries. OpenAI Evals is an open-source software framework that provides tools for creating automated evals.
输出的评估可以由计算机、人类或混合完成。计算机可以使用客观标准(例如,具有单个正确答案的问题)以及一些主观或模糊标准自动进行评估,其中模型输出由其他模型查询评估。OpenAI Evals 是一个开源软件框架,提供用于创建自动评估的工具。
Model-based evals can be useful when there exists a range of possible outputs that would be considered equally high in quality (e.g. for questions with long answers). The boundary between what can be realistically evaluated with a model-based eval and what requires a human to evaluate is fuzzy and is constantly shifting as models become more capable. We encourage experimentation to figure out how well model-based evals can work for your use case.
当存在一系列被认为质量同样高的可能输出时,基于模型的评估可能很有用(例如,对于答案较长的问题)。使用基于模型的评估可以实际评估的内容与需要人工评估的内容之间的界限是模糊的,并且随着模型的能力越来越强而不断变化。我们鼓励进行实验,以确定基于模型的评估在您的用例中的效果如何。
Tactic: Evaluate model outputs with reference to gold-standard answers
策略:参考黄金标准答案评估模型输出
Suppose it is known that the correct answer to a question should make reference to a specific set of known facts. Then we can use a model query to count how many of the required facts are included in the answer.
假设已知一个问题的正确答案应该参考一组特定的已知事实。然后,我们可以使用模型查询来计算答案中包含多少所需事实。
For example, using the following system message:
例如,使用以下系统消息:
Here's an example input where both points are satisfied:
下面是一个同时满足这两点的示例输入:
Here's an example input where only one point is satisfied:
下面是一个仅满足一个点的示例输入:
Here's an example input where none are satisfied:
下面是一个不满足任何条件的示例输入:
There are many possible variants on this type of model-based eval. Consider the following variation which tracks the kind of overlap between the candidate answer and the gold-standard answer, and also tracks whether the candidate answer contradicts any part of the gold-standard answer.
这种基于模型的评估有许多可能的变体。考虑以下变体,该变体跟踪候选答案和黄金标准答案之间的重叠类型,并跟踪候选答案是否与黄金标准答案的任何部分相矛盾。
Here's an example input with a substandard answer which nonetheless does not contradict the expert answer:
下面是一个示例输入,其答案不合格,但与专家答案并不矛盾:
Here's an example input with answer that directly contradicts the expert answer:
下面是一个示例输入,其答案与专家答案直接矛盾:
Here's an example input with a correct answer that also provides a bit more detail than is necessary:
下面是一个示例输入,其中包含一个正确答案,该输入还提供了比必要更多的详细信息:
Other resources 其他资源
For more inspiration, visit the OpenAI Cookbook, which contains example code and also links to third-party resources such as:
如需更多灵感,请访问 OpenAI Cookbook,其中包含示例代码以及指向第三方资源的链接,例如:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步