Skip to main content

如何从 LLM 流式传输响应

¥How to stream responses from an LLM

所有 LLM 都实现了 Runnable 接口Runnable 接口 附带标准可运行方法的默认实现(即 ainvokebatchabatchstreamastreamastream_events)。

¥All LLMs implement the Runnable interface, which comes with default implementations of standard runnable methods (i.e. ainvoke, batch, abatch, stream, astream, astream_events).

默认的流实现提供了一个 AsyncGenerator,它只产生一个值:底层聊天模型提供程序的最终输出。

¥The default streaming implementations provide an AsyncGenerator that yields a single value: the final output from the underlying chat model provider.

逐个令牌流式传输输出的能力取决于提供商是否已实现适当的流式传输支持。

¥The ability to stream the output token-by-token depends on whether the provider has implemented proper streaming support.

查看哪个 集成支持逐个令牌的流式传输

¥See which integrations support token-by-token streaming here.

:::{.callout-note}

默认实现不支持逐个标记的流式传输,但由于它支持相同的标准接口,因此可以确保该模型可以替换为任何其他模型。

¥The default implementation does not provide support for token-by-token streaming, but it ensures that the model can be swapped in for any other model as it supports the same standard interface.

:::

使用 .stream()

¥Using .stream()

最简单的流式传输方式是使用 .stream() 方法。这将返回一个可读流,你也可以对其进行迭代:

¥The easiest way to stream is to use the .stream() method. This returns an readable stream that you can also iterate over:

npm install @langchain/openai @langchain/core
import { OpenAI } from "@langchain/openai";

const model = new OpenAI({
maxTokens: 25,
});

const stream = await model.stream("Tell me a joke.");

for await (const chunk of stream) {
console.log(chunk);
}

/*


Q
:
What
did
the
fish
say
when
it
hit
the
wall
?


A
:
Dam
!
*/

API Reference:

对于像 Gemini 这样支持视频和其他字节输入的模型,API 也支持原生的、特定于模型的表示。

¥For models that do not support streaming, the entire response will be returned as a single chunk.

使用回调处理程序

¥Using a callback handler

你也可以像这样使用 CallbackHandler

¥You can also use a CallbackHandler like so:

import { OpenAI } from "@langchain/openai";

// To enable streaming, we pass in `streaming: true` to the LLM constructor.
// Additionally, we pass in a handler for the `handleLLMNewToken` event.
const model = new OpenAI({
maxTokens: 25,
streaming: true,
});

const response = await model.invoke("Tell me a joke.", {
callbacks: [
{
handleLLMNewToken(token: string) {
console.log({ token });
},
},
],
});
console.log(response);
/*
{ token: '\n' }
{ token: '\n' }
{ token: 'Q' }
{ token: ':' }
{ token: ' Why' }
{ token: ' did' }
{ token: ' the' }
{ token: ' chicken' }
{ token: ' cross' }
{ token: ' the' }
{ token: ' playground' }
{ token: '?' }
{ token: '\n' }
{ token: 'A' }
{ token: ':' }
{ token: ' To' }
{ token: ' get' }
{ token: ' to' }
{ token: ' the' }
{ token: ' other' }
{ token: ' slide' }
{ token: '.' }


Q: Why did the chicken cross the playground?
A: To get to the other slide.
*/

API Reference:

如果使用 generate,我们仍然可以访问最终的 LLMResult。然而,目前并非所有模型提供商在进行流式传输时都支持 tokenUsage

¥We still have access to the end LLMResult if using generate. However, tokenUsage may not be currently supported for all model providers when streaming.