Skip to main content

OpenAI 函数元数据标记器

¥OpenAI functions metadata tagger

使用结构化元数据(例如文档的标题、语气或长度)标记已提取的文档通常很有用,以便以后进行更有针对性的相似性搜索。但是,对于大量文档,手动执行此标记过程可能非常繁琐。

¥It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.

MetadataTagger 文档转换器通过根据提供的模式从每个提供的文档中提取元数据来自动化此过程。它在底层使用可配置的 OpenAI Functions 驱动链,因此如果你传递自定义 LLM 实例,则该实例必须是支持函数的 OpenAI 模型。

¥The MetadataTagger document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable OpenAI Functions-powered chain under the hood, so if you pass a custom LLM instance, it must be an OpenAI model with functions support.

注意:此文档转换器最适合处理完整文档,因此最好先对整个文档运行它,然后再进行任何其他拆分或处理!

¥Note: This document transformer works best with complete documents, so it's best to run it first with whole documents before doing any other splitting or processing!

用法

¥Usage

例如,假设你想索引一组电影评论。你可以按如下方式初始化文档转换器:

¥For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer as follows:

npm install @langchain/openai @langchain/core
import { z } from "zod";
import { createMetadataTaggerFromZod } from "langchain/document_transformers/openai_functions";
import { ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";

const zodSchema = z.object({
movie_title: z.string(),
critic: z.string(),
tone: z.enum(["positive", "negative"]),
rating: z
.optional(z.number())
.describe("The number of stars the critic rated the movie"),
});

const metadataTagger = createMetadataTaggerFromZod(zodSchema, {
llm: new ChatOpenAI({ model: "gpt-3.5-turbo" }),
});

const documents = [
new Document({
pageContent:
"Review of The Bee Movie\nBy Roger Ebert\nThis is the greatest movie ever made. 4 out of 5 stars.",
}),
new Document({
pageContent:
"Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata: { reliable: false },
}),
];
const taggedDocuments = await metadataTagger.transformDocuments(documents);

console.log(taggedDocuments);

/*
[
Document {
pageContent: 'Review of The Bee Movie\n' +
'By Roger Ebert\n' +
'This is the greatest movie ever made. 4 out of 5 stars.',
metadata: {
movie_title: 'The Bee Movie',
critic: 'Roger Ebert',
tone: 'positive',
rating: 4
}
},
Document {
pageContent: 'Review of The Godfather\n' +
'By Anonymous\n' +
'\n' +
'This movie was super boring. 1 out of 5 stars.',
metadata: {
movie_title: 'The Godfather',
critic: 'Anonymous',
tone: 'negative',
rating: 1,
reliable: false
}
}
]
*/

API Reference:

还有一个额外的 createMetadataTagger 方法,它同样接受有效的 JSON Schema 对象。

¥There is an additional createMetadataTagger method that accepts a valid JSON Schema object as well.

自定义

¥Customization

你可以在第二个选项参数中将标准 LLMChain 参数传递给底层标记链。例如,如果你想让 LLM 关注输入文档中的特定细节,或者以某种风格提取元数据,你可以传入一个自定义提示:

¥You can pass the underlying tagging chain the standard LLMChain arguments in the second options parameter. For example, if you wanted to ask the LLM to focus specific details in the input documents, or extract metadata in a certain style, you could pass in a custom prompt:

import { z } from "zod";
import { createMetadataTaggerFromZod } from "langchain/document_transformers/openai_functions";
import { ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { PromptTemplate } from "@langchain/core/prompts";

const taggingChainTemplate = `Extract the desired information from the following passage.
Anonymous critics are actually Roger Ebert.

Passage:
{input}
`;

const zodSchema = z.object({
movie_title: z.string(),
critic: z.string(),
tone: z.enum(["positive", "negative"]),
rating: z
.optional(z.number())
.describe("The number of stars the critic rated the movie"),
});

const metadataTagger = createMetadataTaggerFromZod(zodSchema, {
llm: new ChatOpenAI({ model: "gpt-3.5-turbo" }),
prompt: PromptTemplate.fromTemplate(taggingChainTemplate),
});

const documents = [
new Document({
pageContent:
"Review of The Bee Movie\nBy Roger Ebert\nThis is the greatest movie ever made. 4 out of 5 stars.",
}),
new Document({
pageContent:
"Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata: { reliable: false },
}),
];
const taggedDocuments = await metadataTagger.transformDocuments(documents);

console.log(taggedDocuments);

/*
[
Document {
pageContent: 'Review of The Bee Movie\n' +
'By Roger Ebert\n' +
'This is the greatest movie ever made. 4 out of 5 stars.',
metadata: {
movie_title: 'The Bee Movie',
critic: 'Roger Ebert',
tone: 'positive',
rating: 4
}
},
Document {
pageContent: 'Review of The Godfather\n' +
'By Anonymous\n' +
'\n' +
'This movie was super boring. 1 out of 5 stars.',
metadata: {
movie_title: 'The Godfather',
critic: 'Roger Ebert',
tone: 'negative',
rating: 1,
reliable: false
}
}
]
*/

API Reference: