Skip to main content

检索

¥Retrieval

[Security]

这里回顾的一些概念利用模型来生成查询(例如,用于 SQL 或图形数据库)。这样做存在固有风险。确保你的数据库连接权限范围尽可能狭窄,以满足应用的需求。这将减轻(但不能消除)构建能够查询数据库的模型驱动系统的风险。更多信息,请参阅 安全指南

¥Some of the concepts reviewed here utilize models to generate queries (e.g., for SQL or graph databases). There are inherent risks in doing this. Make sure that your database connection permissions are scoped as narrowly as possible for your application's needs. This will mitigate, though not eliminate, the risks of building a model-driven system capable of querying databases. For more on general security best practices, see our security guide.

概述

¥Overview

检索系统是许多 AI 应用的基础,能够有效地从大型数据集中识别相关信息。这些系统可适应各种数据格式:

¥Retrieval systems are fundamental to many AI applications, efficiently identifying relevant information from large datasets. These systems accommodate various data formats:

  • 非结构化文本(例如文档)通常存储在向量存储或词汇搜索索引中。

    ¥Unstructured text (e.g., documents) is often stored in vector stores or lexical search indexes.

  • 结构化数据通常存储在具有定义模式的关系数据库或图形数据库中。

    ¥Structured data is typically housed in relational or graph databases with defined schemas.

尽管数据格式多种多样,但现代人工智能应用越来越倾向于通过自然语言接口访问所有类型的数据。模型通过将自然语言查询转换为与底层搜索索引或数据库兼容的格式,在此过程中发挥着至关重要的作用。这种转换使与复杂数据结构的交互更加直观和灵活。

¥Despite this diversity in data formats, modern AI applications increasingly aim to make all types of data accessible through natural language interfaces. Models play a crucial role in this process by translating natural language queries into formats compatible with the underlying search index or database. This translation enables more intuitive and flexible interactions with complex data structures.

关键概念

¥Key concepts

Retrieval

(1)查询分析:模型转换或构建搜索查询以优化检索的过程。

¥(1) Query analysis: A process where models transform or construct search queries to optimize retrieval.

(2)信息检索:搜索查询用于从各种检索系统中获取信息。

¥(2) Information retrieval: Search queries are used to fetch information from various retrieval systems.

查询分析

¥Query analysis

虽然用户通常更喜欢使用自然语言与检索系统交互,但检索系统可以使用特定的查询语法或从特定的关键字中受益。查询分析是原始用户输入和优化搜索查询之间的桥梁。查询分析的一些常见应用包括:

¥While users typically prefer to interact with retrieval systems using natural language, retrieval systems can specific query syntax or benefit from particular keywords. Query analysis serves as a bridge between raw user input and optimized search queries. Some common applications of query analysis include:

  1. 查询重写:查询可以重写或扩展,以改进语义或词汇搜索。

    ¥Query Re-writing: Queries can be re-written or expanded to improve semantic or lexical searches.

  2. 查询构造:搜索索引可能需要结构化查询(例如,数据库的 SQL)。

    ¥Query Construction: Search indexes may require structured queries (e.g., SQL for databases).

查询分析使用模型从原始用户输入转换或构建优化的搜索查询。

¥Query analysis employs models to transform or construct optimized search queries from raw user input.

查询重写

¥Query re-writing

理想情况下,检索系统应该能够处理各种各样的用户输入,从简单、措辞不当的查询到复杂、多方面的问题。为了实现这种多功能性,一种流行的方法是使用模型将原始用户查询转换为更有效的搜索查询。这种转换范围从简单的关键字提取到复杂的查询扩展和重新表述。以下是在非结构化数据检索中使用模型进行查询分析的一些主要优势:

¥Retrieval systems should ideally handle a wide spectrum of user inputs, from simple and poorly worded queries to complex, multi-faceted questions. To achieve this versatility, a popular approach is to use models to transform raw user queries into more effective search queries. This transformation can range from simple keyword extraction to sophisticated query expansion and reformulation. Here are some key benefits of using models for query analysis in unstructured data retrieval:

  1. 查询说明:模型可以重新表述含糊不清或措辞不当的查询,使其更清晰。

    ¥Query Clarification: Models can rephrase ambiguous or poorly worded queries for clarity.

  2. 语义理解:它们可以捕捉查询背后的意图,超越文字关键字匹配。

    ¥Semantic Understanding: They can capture the intent behind a query, going beyond literal keyword matching.

  3. 查询扩展:模型可以生成相关的术语或概念以扩大搜索范围。

    ¥Query Expansion: Models can generate related terms or concepts to broaden the search scope.

  4. 复杂查询处理:它们可以将多部分问题分解为更简单的子查询。

    ¥Complex Query Handling: They can break down multi-part questions into simpler sub-queries.

已经开发出各种技术来利用模型进行查询重写,包括:

¥Various techniques have been developed to leverage models for query re-writing, including:

NameWhen to useDescription
DecompositionWhen a question can be broken down into smaller subproblems.Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer).
Step-backWhen a higher-level conceptual understanding is required.First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. Paper.
HyDEIf you have challenges retrieving relevant documents using the raw user inputs.Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. Paper.

例如,查询分解可以简单地通过提示和强制执行子问题列表的结构化输出来完成。然后,它们可以在下游检索系统上顺序或并行运行。

¥As an example, query decomposition can simply be accomplished using prompting and a structured output that enforces a list of sub-questions. These can then be run sequentially or in parallel on a downstream retrieval system.

import { z } from "zod";
import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage, HumanMessage } from "@langchain/core/messages";

// Define a zod object for the structured output
const Questions = z.object({
questions: z
.array(z.string())
.describe("A list of sub-questions related to the input query."),
});

// Create an instance of the model and enforce the output structure
const model = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 });
const structuredModel = model.withStructuredOutput(Questions);

// Define the system prompt
const system = `You are a helpful assistant that generates multiple sub-questions related to an input question.
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation.`;

// Pass the question to the model
const question =
"What are the main components of an LLM-powered autonomous agent system?";
const questions = await structuredModel.invoke([
new SystemMessage(system),
new HumanMessage(question),
]);
tip

观看我们关于 Scratch 构建 RAG 视频,了解几种不同的具体方法:

¥See our RAG from Scratch videos for a few different specific approaches:

查询构造

¥Query construction

查询分析还可以专注于将自然语言查询转换为专门的查询语言或过滤器。这种转换对于有效地与存储结构化或半结构化数据的各种类型的数据库交互至关重要。

¥Query analysis also can focus on translating natural language queries into specialized query languages or filters. This translation is crucial for effectively interacting with various types of databases that house structured or semi-structured data.

  1. 结构化数据示例:有关除 之外的速率限制算法,请参阅 。

    ¥Structured Data examples: For relational and graph databases, Domain-Specific Languages (DSLs) are used to query data.

  2. 半结构化数据示例:对于向量存储,查询可以将语义搜索与元数据过滤相结合。

    ¥Semi-structured Data examples: For vectorstores, queries can combine semantic search with metadata filtering.

这些方法利用模型来弥合用户意图与不同数据存储系统的特定查询需求之间的差距。以下是一些常用的技术:

¥These approaches leverage models to bridge the gap between user intent and the specific query requirements of different data storage systems. Here are some popular techniques:

NameWhen to UseDescription
Self QueryIf users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text.This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself).
Text to SQLIf users are asking questions that require information housed in a relational database, accessible via SQL.This uses an LLM to transform user input into a SQL query.
Text-to-CypherIf users are asking questions that require information housed in a graph database, accessible via Cypher.This uses an LLM to transform user input into a Cypher query.

以下是使用 SelfQueryRetriever 将自然语言查询转换为元数据过滤器的示例。

¥As an example, here is how to use the SelfQueryRetriever to convert natural language queries into metadata filters.

import { SelfQueryRetriever } from "langchain/retrievers/self_query";
import { AttributeInfo } from "langchain/chains/query_constructor";
import { ChatOpenAI } from "@langchain/openai";

const attributeInfo: AttributeInfo[] = schemaForMetadata;
const documentContents = "Brief summary of a movie";
const llm = new ChatOpenAI({ temperature: 0 });
const retriever = SelfQueryRetriever.fromLLM({
llm,
vectorStore,
documentContents,
attributeInfo,
});
[Further reading]

信息检索

¥Information retrieval

常见检索系统

¥Common retrieval systems

词汇搜索索引

¥Lexical search indexes

许多搜索引擎都基于将查询中的单词与每个文档中的单词进行匹配。这种方法称为词汇检索,使用搜索 通常基于词频的算法。直觉简单:如果某个词在用户查询和特定文档中都出现频率较高,那么该文档可能与之匹配。

¥Many search engines are based upon matching words in a query to the words in each document. This approach is called lexical retrieval, using search algorithms that are typically based upon word frequencies. The intution is simple: a word appears frequently both in the user’s query and a particular document, then this document might be a good match.

用于实现此功能的特定数据结构通常是 倒排索引。这种类型的索引包含一个单词列表以及每个单词到其在各个文档中出现位置列表的映射。使用此数据结构,可以有效地将搜索查询中的单词与它们出现的文档进行匹配。BM25TF-IDF两种流行的词汇搜索算法

¥The particular data structure used to implement this is often an inverted index. This types of index contains a list of words and a mapping of each word to a list of locations at which it occurs in various documents. Using this data structure, it is possible to efficiently match the words in search queries to the documents in which they appear. BM25 and TF-IDF are two popular lexical search algorithms.

[Further reading]
  • 查看 BM25 检索器集成。

    ¥See the BM25 retriever integration.

向量索引

¥Vector indexes

向量索引是索引和存储非结构化数据的另一种方法。请参阅我们关于 vectorstores 的概念指南,了解详细概述。简而言之,向量存储 (vectorstore) 不使用词频,而是使用 嵌入模型 将文档压缩为高维向量表示。这允许使用简单的数学运算(例如余弦相似度)对嵌入向量进行高效的相似度搜索。

¥Vector indexes are an alternative way to index and store unstructured data. See our conceptual guide on vectorstores for a detailed overview.\ In short, rather than using word frequencies, vectorstores use an embedding model to compress documents into high-dimensional vector representation. This allows for efficient similarity search over embedding vectors using simple mathematical operations like cosine similarity.

[Further reading]

关系数据库

¥Relational databases

关系数据库是许多应用中使用的一种基本结构化数据存储类型。它们将数据组织到具有预定义模式的表中,其中每个表代表一个实体或关系。数据以行(记录)和列(属性)的形式存储,从而可以通过 SQL(结构化查询语言)进行高效的查询和操作。关系数据库擅长维护数据完整性、支持复杂查询以及处理不同数据实体之间的关系。

¥Relational databases are a fundamental type of structured data storage used in many applications. They organize data into tables with predefined schemas, where each table represents an entity or relationship. Data is stored in rows (records) and columns (attributes), allowing for efficient querying and manipulation through SQL (Structured Query Language). Relational databases excel at maintaining data integrity, supporting complex queries, and handling relationships between different data entities.

[Further reading]
  • 观看我们的 tutorial,了解如何使用 SQL 数据库。

    ¥See our tutorial for working with SQL databases.

图形数据库

¥Graph databases

图形数据库是一种特殊的数据库,旨在存储和管理高度互联的数据。与传统的关系数据库不同,图数据库使用由节点(实体)、边(关系)和属性组成的灵活结构。这种结构允许高效地表示和查询复杂的互连数据。图形数据库以图形结构存储数据,包含节点、边和属性。它们对于存储和查询数据点之间的复杂关系特别有用,例如社交网络、供应链管理、欺诈检测和推荐服务。

¥Graph databases are a specialized type of database designed to store and manage highly interconnected data. Unlike traditional relational databases, graph databases use a flexible structure consisting of nodes (entities), edges (relationships), and properties. This structure allows for efficient representation and querying of complex, interconnected data. Graph databases store data in a graph structure, with nodes, edges, and properties. They are particularly useful for storing and querying complex relationships between data points, such as social networks, supply-chain management, fraud detection, and recommendation services

[Further reading]

检索器

¥Retriever

LangChain 通过 retriever 概念提供了一个统一的接口,用于与各种检索系统交互。界面直观:

¥LangChain provides a unified interface for interacting with various retrieval systems through the retriever concept. The interface is straightforward:

  1. 输入:查询(字符串)

    ¥Input: A query (string)

  2. 输出:文档列表(标准化的 LangChain 文档 对象)

    ¥Output: A list of documents (standardized LangChain Document objects)

你可以使用前面提到的任何检索系统创建检索器。我们讨论的查询分析技术在这里特别有用,因为它们为通常需要结构化查询语言的数据库提供了自然语言接口。例如,你可以使用文本到 SQL 的转换构建 SQL 数据库的检索器。这允许自然语言查询(字符串)在后台转换为 SQL 查询。无论底层检索系统如何,LangChain 中的所有检索器都共享一个通用接口。你可以将它们与简单的 invoke 方法结合使用:

¥You can create a retriever using any of the retrieval systems mentioned earlier. The query analysis techniques we discussed are particularly useful here, as they enable natural language interfaces for databases that typically require structured query languages. For example, you can build a retriever for a SQL database using text-to-SQL conversion. This allows a natural language query (string) to be transformed into a SQL query behind the scenes. Regardless of the underlying retrieval system, all retrievers in LangChain share a common interface. You can use them with the simple invoke method:

const docs = await retriever.invoke(query);
[Further reading]