Skip to main content

如何为每个文档生成多个嵌入

¥How to generate multiple embeddings per document

Prerequisites

本指南假设你熟悉以下概念:

¥This guide assumes familiarity with the following concepts:

嵌入原始文档的不同表示形式,然后在任何一种表示形式导致搜索结果时返回原始文档,这可以让你调整和提升检索性能。LangChain 有一个基础 MultiVectorRetriever 库,专门为此而设计!

¥Embedding different representations of an original document, then returning the original document when any of the representations result in a search hit, can allow you to tune and improve your retrieval performance. LangChain has a base MultiVectorRetriever designed to do just this!

很大程度上复杂在于如何为每个文档创建多个向量。本指南介绍创建这些向量和使用 MultiVectorRetriever 的一些常用方法。

¥A lot of the complexity lies in how to create the multiple vectors per document. This guide covers some of the common ways to create those vectors and use the MultiVectorRetriever.

为每个文档创建多个向量的一些方法包括:

¥Some methods to create multiple vectors per document include:

  • 更小的块:将文档拆分成更小的块,并嵌入这些块(例如 ParentDocumentRetriever

    ¥smaller chunks: split a document into smaller chunks, and embed those (e.g. the ParentDocumentRetriever)

  • 摘要:为每个文档创建摘要,并将其与文档一起嵌入(或代替文档嵌入)

    ¥summary: create a summary for each document, embed that along with (or instead of) the document

  • 假设性问题:创建每个文档都适合回答的假设问题,并将其与文档一起嵌入(或代替文档嵌入)

    ¥hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document

请注意,这也启用了另一种添加嵌入的方法。 - 手动。这很棒,因为你可以明确添加问题或查询,这些问题或查询应该会导致文档被恢复,从而赋予你更多控制权。

¥Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

更小的块

¥Smaller chunks

通常,检索较大的信息块,但嵌入较小的信息块会很有用。这允许嵌入尽可能紧密地捕捉语义,但同时尽可能多地将上下文传递到下游。注意:这就是 ParentDocumentRetriever 的作用。这里我们展示了底层是如何运作的。

¥Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. NOTE: this is what the ParentDocumentRetriever does. Here we show what is going on under the hood.

npm install @langchain/openai @langchain/community @langchain/core
import * as uuid from "uuid";

import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());

const childSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 0,
});

const subDocs = [];
for (let i = 0; i < docs.length; i += 1) {
const childDocs = await childSplitter.splitDocuments([docs[i]]);
const taggedChildDocs = childDocs.map((childDoc) => {
// eslint-disable-next-line no-param-reassign
childDoc.metadata[idKey] = docIds[i];
return childDoc;
});
subDocs.push(...taggedChildDocs);
}

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
subDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
390
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API Reference:

摘要

¥Summary

通常,摘要可以更准确地提炼出信息块的内容,从而实现更好的检索。这里我们展示了如何创建摘要,然后嵌入它们。

¥Oftentimes a summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those.

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(`Summarize the following document:\n\n{content}`),
new ChatOpenAI({
maxRetries: 0,
}),
new StringOutputParser(),
]);

const summaries = await chain.batch(docs, {
maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const summaryDocs = summaries.map((summary, i) => {
const summaryDoc = new Document({
pageContent: summary,
metadata: {
[idKey]: docIds[i],
},
});
return summaryDoc;
});

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
summaryDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
1118
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API Reference:

假设查询

¥Hypothetical queries

LLM 还可用于生成针对特定文档提出的一系列假设性问题。然后可以嵌入这些问题并用于检索原始文档:

¥An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded and used to retrieve the original document:

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
import { JsonKeyOutputFunctionsParser } from "@langchain/core/output_parsers/openai_functions";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const functionsSchema = [
{
name: "hypothetical_questions",
description: "Generate hypothetical questions",
parameters: {
type: "object",
properties: {
questions: {
type: "array",
items: {
type: "string",
},
},
},
required: ["questions"],
},
},
];

const functionCallingModel = new ChatOpenAI({
maxRetries: 0,
model: "gpt-4",
})
.bindTools(functionsSchema)
.withConfig({
function_call: { name: "hypothetical_questions" },
});

const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(
`Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{content}`
),
functionCallingModel,
new JsonKeyOutputFunctionsParser<string[]>({ attrName: "questions" }),
]);

const hypotheticalQuestions = await chain.batch(docs, {
maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const hypotheticalQuestionDocs = hypotheticalQuestions
.map((questionArray, i) => {
const questionDocuments = questionArray.map((question) => {
const questionDocument = new Document({
pageContent: question,
metadata: {
[idKey]: docIds[i],
},
});
return questionDocument;
});
return questionDocuments;
})
.flat();

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
hypotheticalQuestionDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent);
/*
"What measures will be taken to crack down on corporations overcharging American businesses and consumers?"
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API Reference:

后续步骤

¥Next steps

现在你已经了解了一些为每个文档生成多个嵌入的方法。

¥You've now learned a few ways to generate multiple embeddings per document.

接下来,查看各个部分,深入了解特定检索器、关于 RAG 的更广泛教程,或查看本节,了解如何使用 基于任何数据源创建你自己的自定义检索器

¥Next, check out the individual sections for deeper dives on specific retrievers, the broader tutorial on RAG, or this section to learn how to create your own custom retriever over any data source.