Skip to main content

如何检索整个文档以形成一个块

¥How to retrieve the whole document for a chunk

Prerequisites

本指南假设你熟悉以下概念:

¥This guide assumes familiarity with the following concepts:

在拆分文档进行检索时,通常存在以下相互冲突的需求:

¥When splitting documents for retrieval, there are often conflicting desires:

  1. 你可能需要使用较小的文档,以便其嵌入能够最准确地反映其含义。如果文档过长,则嵌入可能会失去意义。

    ¥You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If documents are too long, then the embeddings can lose meaning.

  2. 你希望文档足够长,以便保留每个块的上下文。

    ¥You want to have long enough documents that the context of each chunk is retained.

ParentDocumentRetriever 通过分割和存储小块数据来达到这种平衡。在检索过程中,它首先获取小块,然后查找这些块的父 ID,并返回这些较大的文档。

¥The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

请注意,"父文档" 指的是小块的来源文档。这可以是整个原始文档,也可以是更大的块。

¥Note that "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

这是 为每个文档生成多个嵌入 的更具体形式。

¥This is a more specific form of generating multiple embeddings per document.

用法

¥Usage

npm install @langchain/openai @langchain/core
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { InMemoryStore } from "@langchain/core/stores";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunks are the larger parent chunks
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'As I did four days ago, I’ve nominated a Circuit Court of Appeals — Ketanji Brown Jackson. One of our nation’s top legal minds who will continue in just Brey- — Justice Breyer’s legacy of excellence. A former top litigator in private practice, a former federal public defender from a family of public-school educators and police officers — she’s a consensus builder.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'Justice Breyer, thank you for your service. Thank you, thank you, thank you. I mean it. Get up. Stand — let me see you. Thank you.\n' +
'\n' +
'And we all know — no matter what your ideology, we all know one of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
}
]
*/

API Reference:

包含分数阈值

¥With Score Threshold

通过设置 scoreThresholdOptions 中的选项,我们可以强制 ParentDocumentRetriever 在后台使用 ScoreThresholdRetriever。这会将 ScoreThresholdRetriever 中的向量存储设置为我们在初始化 ParentDocumentRetriever 时传递的向量存储,同时还允许我们为检索器设置分数阈值。

¥By setting the options in scoreThresholdOptions we can force the ParentDocumentRetriever to use the ScoreThresholdRetriever under the hood. This sets the vector store inside ScoreThresholdRetriever as the one we passed when initializing ParentDocumentRetriever, while also allowing us to also set a score threshold for the retriever.

当你不确定需要多少文档(或者如果你确定,只需设置 maxK 选项),但又想确保获取的文档在一定的相关性阈值范围内时,此功能会很有帮助。

¥This can be helpful when you're not sure how many documents you want (or if you are sure, just set the maxK option), but you want to make sure that the documents you do get are within a certain relevancy threshold.

注意:如果传递了检索器,ParentDocumentRetriever 将默认使用它来检索小块数据,并通过 addDocuments 方法添加文档。

¥Note: if a retriever is passed, ParentDocumentRetriever will default to use it for retrieving small chunks, as well as adding documents via the addDocuments method.

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const childDocumentRetriever = ScoreThresholdRetriever.fromVectorStore(
vectorstore,
{
minSimilarityScore: 0.01, // Essentially no threshold
maxK: 1, // Only return the top result
}
);
const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
childDocumentRetriever,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunk is the larger parent chunk
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
]
*/

API Reference:

包含上下文块头

¥With Contextual chunk headers

设想这样一种场景:你需要将文档集合存储在向量存储中,并对其执行问答任务。简单地拆分文本重叠的文档可能无法为 LLM 提供足够的上下文,使其无法确定多个块是否引用相同的信息,或者无法确定如何解析来自相互矛盾来源的信息。

¥Consider a scenario where you want to store collection of documents in a vector store and perform Q&A tasks on them. Simply splitting documents with overlapping text may not provide sufficient context for LLMs to determine if multiple chunks are referencing the same information, or how to resolve information from contradictory sources.

如果你知道要过滤的内容,那么使用元数据标记每个文档是一种解决方案,但你可能无法提前知道向量存储需要处理哪种类型的查询。在每个块中以标题的形式直接包含额外的上下文信息,有助于处理任意查询。

¥Tagging each document with metadata is a solution if you know what to filter against, but you may not know ahead of time exactly what kind of queries your vector store will be expected to handle. Including additional contextual information directly in each chunk in the form of headers can help deal with arbitrary queries.

如果你有多个需要从向量存储中正确检索的细粒度子块,这一点尤为重要。

¥This is particularly important if you have several fine-grained child chunks that need to be correctly retrieved from the vector store.

import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1500,
chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`My favorite color is blue.`]);
const jimChunkHeaderOptions = {
chunkHeader: "DOC NAME: Jim Interview\n---\n",
appendChunkOverlapHeader: true,
};

const pamDocs = await splitter.createDocuments([`My favorite color is red.`]);
const pamChunkHeaderOptions = {
chunkHeader: "DOC NAME: Pam Interview\n---\n",
appendChunkOverlapHeader: true,
};

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
});

// We pass additional option `childDocChunkHeaderOptions`
// that will add the chunk header to child documents
await retriever.addDocuments(jimDocs, {
childDocChunkHeaderOptions: jimChunkHeaderOptions,
});
await retriever.addDocuments(pamDocs, {
childDocChunkHeaderOptions: pamChunkHeaderOptions,
});

// This will search child documents in vector store with the help of chunk header,
// returning the unmodified parent documents
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
},
{
"pageContent": "My favorite color is blue.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/

const rawDocs = await vectorstore.similaritySearch(
"What is Pam's favorite color?"
);

// Raw docs in vectorstore are short but have chunk headers
console.log(JSON.stringify(rawDocs, null, 2));

/*
[
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) color is",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) favorite",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\nMy",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
}
]
*/

API Reference:

包含重新排序

¥With Reranking

由于向量存储中有许多文档传递给 LLM,最终答案有时会包含不相关数据块的信息,导致其准确性降低,有时甚至不正确。此外,传递多个不相关的文档会增加开销。因此,使用 rerank 有两个原因。 - 精度和成本。

¥With many documents from the vector store that are passed to LLM, final answers sometimes consist of information from irrelevant chunks, making it less precise and sometimes incorrect. Also, passing multiple irrelevant documents makes it more expensive. So there are two reasons to use rerank - precision and costs.

import { OpenAIEmbeddings } from "@langchain/openai";
import { CohereRerank } from "@langchain/cohere";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import {
ParentDocumentRetriever,
type SubDocs,
} from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

// init Cohere Rerank. Remember to add COHERE_API_KEY to your .env
const reranker = new CohereRerank({
topN: 50,
model: "rerank-multilingual-v2.0",
});

export function documentCompressorFiltering({
relevanceScore,
}: { relevanceScore?: number } = {}) {
return (docs: SubDocs) => {
let outputDocs = docs;

if (relevanceScore) {
const docsRelevanceScoreValues = docs.map(
(doc) => doc?.metadata?.relevanceScore
);
outputDocs = docs.filter(
(_doc, index) =>
(docsRelevanceScoreValues?.[index] || 1) >= relevanceScore
);
}

return outputDocs;
};
}

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`Jim favorite color is blue.`]);

const pamDocs = await splitter.createDocuments([`Pam favorite color is red.`]);

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();

const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
// We add Reranker
documentCompressor: reranker,
documentCompressorFilteringFn: documentCompressorFiltering({
relevanceScore: 0.3,
}),
});

const docs = jimDocs.concat(pamDocs);
await retriever.addDocuments(docs);

// This will search for documents in vector store and return for LLM already reranked and sorted document
// with appropriate minimum relevance score
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"relevanceScore": 0.9
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/

API Reference:

后续步骤

¥Next steps

现在你已经了解了如何使用 ParentDocumentRetriever

¥You've now learned how to use the ParentDocumentRetriever.

接下来,查看 为每个文档生成多个嵌入 的更通用形式、关于 RAG 的更广泛教程,或查看本节,了解如何使用 基于任何数据源创建你自己的自定义检索器

¥Next, check out the more general form of generating multiple embeddings per document, the broader tutorial on RAG, or this section to learn how to create your own custom retriever over any data source.