Skip to main content

多模态

¥Multimodality

概述

¥Overview

多模态是指处理不同形式数据的能力,例如文本、音频、图片和视频。多模态可以出现在各种组件中,使模型和系统能够无缝地处理这些数据类型的混合。

¥Multimodality refers to the ability to work with data that comes in different forms, such as text, audio, images, and video. Multimodality can appear in various components, allowing models and systems to handle and process a mix of these data types seamlessly.

  • 聊天模型:理论上,这些可以接受并生成多模态输入和输出,处理各种数据类型,例如文本、图片、音频和视频。

    ¥Chat Models: These could, in theory, accept and generate multimodal inputs and outputs, handling a variety of data types like text, images, audio, and video.

  • 嵌入模型:嵌入模型可以表示多模态内容,将各种形式的数据(例如文本、图片和音频)嵌入到向量空间中。

    ¥Embedding Models: Embedding Models can represent multimodal content, embedding various forms of data—such as text, images, and audio—into vector spaces.

  • 向量存储:向量存储可以搜索表示多模态数据的嵌入,从而实现跨不同类型信息的检索。

    ¥Vector Stores: Vector stores could search over embeddings that represent multimodal data, enabling retrieval across different types of information.

聊天模型中的多模态

¥Multimodality in chat models

多模态支持仍然相对较新且不太常见,模型提供商尚未就 "best" 定义 API 的方式进行标准化。因此,LangChain 的多模态抽象轻量且灵活,旨在适应不同模型提供商的 API 和交互模式,但并未跨模型标准化。

¥Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are not standardized across models.

如何使用多模态模型

¥How to use multimodal models

支持哪种多模态?

¥What kind of multimodality is supported?

输入

¥Inputs

某些模型可以接受多模态输入,例如图片、音频、视频或文件。支持的多模态输入类型取决于模型提供者。例如,Google Gemini 支持 PDF 等文档作为输入。

¥Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, Google's Gemini supports documents like PDFs as inputs.

大多数支持多模式输入的聊天模型也接受 OpenAI 内容块格式的输入值。目前,这仅限于图片输入。对于许多应用(例如聊天机器人),模型需要直接使用自然语言响应用户。

¥Most chat models that support multimodal inputs also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.

将多模态输入传递给聊天模型的要点是使用指定类型和相应数据的内容块。例如,将图片传递给聊天模型:

¥The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:

import { HumanMessage } from "@langchain/core/messages";

const message = new HumanMessage({
content: [
{ type: "text", text: "describe the weather in this image" },
{ type: "image_url", image_url: { url: image_url } },
],
});
const response = await model.invoke([message]);
caution

内容块的具体格式可能因模型提供商而异。请参阅聊天模型的集成文档以获取正确的格式。在 聊天模型集成表 中查找集成。

¥The exact format of the content blocks may vary depending on the model provider. Please refer to the chat model's integration documentation for the correct format. Find the integration in the chat model integration table.

输出

¥Outputs

截至撰写本文时(2024 年 10 月),几乎没有流行的聊天模型支持多模态输出。

¥Virtually no popular chat models support multimodal outputs at the time of writing (October 2024).

唯一的例外是 OpenAI 的聊天模型 (gpt-4o-audio-preview),它可以生成音频输出。

¥The only exception is OpenAI's chat model (gpt-4o-audio-preview), which can generate audio outputs.

多模态输出将作为 AIMessage 响应对象的一部分出现。

¥Multimodal outputs will appear as part of the AIMessage response object.

有关如何使用多模态输出的更多信息,请参阅 ChatOpenAI

¥Please see the ChatOpenAI for more information on how to use multimodal outputs.

工具

¥Tools

目前,没有任何聊天模型设计为直接处理 工具调用请求ToolMessage 结果中的多模态数据。

¥Currently, no chat model is designed to work directly with multimodal data in a tool call request or ToolMessage result.

然而,聊天模型可以通过调用带有多模态数据(而不是数据本身)引用(例如 URL)的工具,轻松地与多模态数据交互。例如,任何支持 工具调用 的模型都可以配备下载和处理图片、音频或视频的工具。

¥However, a chat model can easily interact with multimodal data by invoking tools with references (e.g., a URL) to the multimodal data, rather than the data itself. For example, any model capable of tool calling can be equipped with tools to download and process images, audio, or video.

嵌入模型中的多模态

¥Multimodality in embedding models

嵌入是用于相似性搜索和检索等任务的数据向量表示。

¥Embeddings are vector representations of data used for tasks like similarity search and retrieval.

LangChain 中当前使用的 嵌入接口 完全针对基于文本的数据进行了优化,不适用于多模态数据。

¥The current embedding interface used in LangChain is optimized entirely for text-based data, and will not work with multimodal data.

随着涉及多模态搜索和检索任务的用例变得越来越普遍,我们期望扩展嵌入接口以适应其他数据类型,例如图片、音频和视频。

¥As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the embedding interface to accommodate other data types like images, audio, and video.

向量存储中的多模态

¥Multimodality in vector stores

向量存储是用于存储和检索嵌入的数据库,通常用于搜索和检索任务。与嵌入类似,向量存储目前针对基于文本的数据进行了优化。

¥Vector stores are databases for storing and retrieving embeddings, which are typically used in search and retrieval tasks. Similar to embeddings, vector stores are currently optimized for text-based data.

随着涉及多模态搜索和检索任务的用例变得越来越普遍,我们期望扩展向量存储接口以适应其他数据类型,例如图片、音频和视频。

¥As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the vector store interface to accommodate other data types like images, audio, and video.