快照提示
¥Few-shot prompting
概述
¥Overview
提升模型性能的最有效方法之一是为模型提供你希望它执行的操作示例。将示例输入和预期输出添加到模型提示的技术称为 "少量样本提示"。该技术基于 语言模型是少样本学习器 论文。进行小样本提示时需要考虑以下几点:
¥One of the most effective ways to improve model performance is to give a model examples of what you want it to do. The technique of adding example inputs and expected outputs to a model prompt is known as "few-shot prompting". The technique is based on the Language Models are Few-Shot Learners paper. There are a few things to think about when doing few-shot prompting:
示例是如何生成的?
¥How are examples generated?
每个提示中有多少个示例?
¥How many examples are in each prompt?
如何在运行时选择示例?
¥How are examples selected at runtime?
提示中的示例是如何格式化的?
¥How are examples formatted in the prompt?
以下是每种技术的注意事项。
¥Here are the considerations for each.
1. 生成示例
¥ Generating examples
小样本提示的第一步也是最重要的一步是建立一个良好的示例数据集。好的示例应该在运行时相关、清晰、信息丰富,并提供模型尚未知道的信息。
¥The first and most important step of few-shot prompting is coming up with a good dataset of examples. Good examples should be relevant at runtime, clear, informative, and provide information that was not already known to the model.
从高层次来看,生成示例的基本方法如下:
¥At a high-level, the basic ways to generate examples are:
手册:一个人/多个人生成他们认为有用的示例。
¥Manual: a person/people generates examples they think are useful.
更好的模型:一个更好(可能更昂贵/更慢)的模型的响应被用作一个更差(可能更便宜/更快)模型的示例。
¥Better model: a better (presumably more expensive/slower) model's responses are used as examples for a worse (presumably cheaper/faster) model.
用户反馈:用户(或标注者)会留下与应用交互的反馈,并根据该反馈生成示例(例如,所有获得积极反馈的交互都可以转换为示例)。
¥User feedback: users (or labelers) leave feedback on interactions with the application and examples are generated based on that feedback (for example, all interactions with positive feedback could be turned into examples).
大语言模型反馈:与用户反馈相同,但该过程通过模型自我评估实现自动化。
¥LLM feedback: same as user feedback but the process is automated by having models evaluate themselves.
哪种方法最佳取决于你的任务。有关如何使用检索器的详细信息,请参阅 。对于正确行为空间更广阔、更细致的任务,以更自动化的方式生成大量示例会很有帮助,这样更有可能出现一些与任何运行时输入高度相关的示例。
¥Which approach is best depends on your task. For tasks where a small number core principles need to be understood really well, it can be valuable hand-craft a few really good examples. For tasks where the space of correct behaviors is broader and more nuanced, it can be useful to generate many examples in a more automated fashion so that there's a higher likelihood of there being some highly relevant examples for any runtime input.
单轮 vs.多轮示例
¥Single-turn v.s. multi-turn examples
生成样本时需要考虑的另一个维度是样本实际展示的内容。
¥Another dimension to think about when generating examples is what the example is actually showing.
最简单的示例类型仅包含用户输入和预期的模型输出。这些是单轮示例。
¥The simplest types of examples just have a user input and an expected model output. These are single-turn examples.
一种更复杂的示例类型是,示例是一整段对话,通常模型最初会做出错误的回答,然后用户会告诉模型如何纠正答案。这被称为多轮示例。多轮示例对于更细致的任务非常有用,在这些任务中,它有助于展示常见错误,并准确说明错误原因以及应该采取的措施。
¥One more complex type if example is where the example is an entire conversation, usually in which a model initially responds incorrectly and a user then tells the model how to correct its answer. This is called a multi-turn example. Multi-turn examples can be useful for more nuanced tasks where its useful to show common errors and spell out exactly why they're wrong and what should be done instead.
2. 示例数量
¥ Number of examples
有了示例数据集后,我们需要考虑每个提示中应该包含多少个示例。关键的权衡是,更多的示例通常可以提高性能,但更大的提示会增加成本和延迟。超过一定阈值后,过多的示例可能会使模型开始混乱。找到合适的示例数量在很大程度上取决于模型、任务、示例的质量以及你的成本和延迟限制。有趣的是,模型越好,它需要的样本就越少,而添加更多样本带来的收益递减就越快。但是,可靠地回答这个问题的最佳/唯一方法是使用不同数量的示例进行一些实验。
¥Once we have a dataset of examples, we need to think about how many examples should be in each prompt. The key tradeoff is that more examples generally improve performance, but larger prompts increase costs and latency. And beyond some threshold having too many examples can start to confuse the model. Finding the right number of examples is highly dependent on the model, the task, the quality of the examples, and your cost and latency constraints. Anecdotally, the better the model is the fewer examples it needs to perform well and the more quickly you hit steeply diminishing returns on adding more examples. But, the best/only way to reliably answer this question is to run some experiments with different numbers of examples.
3. 选择示例
¥ Selecting examples
假设我们不会将整个示例数据集添加到每个提示中,我们需要一种根据给定输入从数据集中选择示例的方法。我们可以这样做:
¥Assuming we are not adding our entire example dataset into each prompt, we need to have a way of selecting examples from our dataset based on a given input. We can do this:
随机
¥Randomly
通过(基于语义或关键字的)输入相似性
¥By (semantic or keyword-based) similarity of the inputs
基于一些其他约束,例如令牌大小
¥Based on some other constraints, like token size
LangChain 拥有多种 ExampleSelectors
接口,可以轻松使用上述任何技术。
¥LangChain has a number of ExampleSelectors
which make it easy to use any of these techniques.
通常,按语义相似性进行选择可获得最佳模型性能。但这的重要性又取决于具体的模型和任务,值得进行尝试。
¥Generally, selecting by semantic similarity leads to the best model performance. But how important this is is again model and task specific, and is something worth experimenting with.
4. 格式化示例
¥ Formatting examples
如今,大多数最先进的模型都是聊天模型,因此我们将重点介绍这些模型的格式化示例。我们的基本选项是插入以下示例:
¥Most state-of-the-art models these days are chat models, so we'll focus on formatting examples for those. Our basic options are to insert the examples:
在系统提示符中作为字符串
¥In the system prompt as a string
作为他们自己的消息
¥As their own messages
如果我们将示例作为字符串插入系统提示符中,我们需要确保模型清楚每个示例的起始位置以及哪些部分是输入,哪些部分是输出。不同的模型对不同的语法(例如 ChatML、XML、TypeScript 等)的响应效果更好。
¥If we insert our examples into the system prompt as a string, we'll need to make sure it's clear to the model where each example begins and which parts are the input versus output. Different models respond better to different syntaxes, like ChatML, XML, TypeScript, etc.
如果我们将示例作为消息插入,其中每个示例都表示为一系列人类和人工智能消息,我们可能还需要将 names 分配给我们的消息,例如 "example_user"
和 "example_assistant"
,以清楚地表明这些消息对应于不同的参与者,而不是最新的输入消息。
¥If we insert our examples as messages, where each example is represented as a sequence of Human, AI messages, we might want to also assign names to our messages like "example_user"
and "example_assistant"
to make it clear that these messages correspond to different actors than the latest input message.
格式化工具调用示例
¥Formatting tool call examples
将示例格式化为消息可能会比较棘手,因为我们的示例输出包含工具调用。这是因为不同的模型对生成任何工具调用时允许的消息序列类型有不同的限制。
¥One area where formatting examples as messages can be tricky is when our example outputs have tool calls. This is because different models have different constraints on what types of message sequences are allowed when any tool calls are generated.
某些模型要求任何带有工具调用的
AIMessage
之后必须紧接着ToolMessage
。¥Some models require that any
AIMessage
with tool calls be immediately followed byToolMessage
s for every tool call,某些模型还要求任何
ToolMessage
之后必须紧接着AIMessage
,然后再紧接着下一个HumanMessage
。¥Some models additionally require that any
ToolMessage
s be immediately followed by anAIMessage
before the nextHumanMessage
,某些模型要求,如果聊天历史记录中有任何工具调用/
ToolMessage
,则将工具传递给模型。¥Some models require that tools are passed in to the model if there are any tool calls /
ToolMessage
s in the chat history.
这些要求特定于模型,应根据你使用的模型进行检查。如果你的模型要求在工具调用后执行 ToolMessage
,和/或在 ToolMessage
后执行 AIMessage
,并且你的示例仅包含预期的工具调用而非实际的工具输出,你可以尝试在每个示例末尾添加虚拟的 ToolMessage
/ AIMessage
,并使用通用内容来满足 API 约束。在这种情况下,尤其值得尝试将示例作为字符串而不是消息插入,因为使用虚拟消息可能会对某些模型产生不利影响。
¥These requirements are model-specific and should be checked for the model you are using. If your model requires ToolMessage
s after tool calls and/or AIMessage
s after ToolMessage
s and your examples only include expected tool calls and not the actual tool outputs, you can try adding dummy ToolMessage
s / AIMessage
s to the end of each example with generic contents to satisfy the API constraints.
In these cases it's especially worth experimenting with inserting your examples as strings versus messages, as having dummy messages can adversely affect certain models.
你可以查看一个案例研究,了解 Anthropic 和 OpenAI 如何在两个不同的工具调用基准 此处 上响应不同的少样本提示技术。
¥You can see a case study of how Anthropic and OpenAI respond to different few-shot prompting techniques on two different tool calling benchmarks here.