如何处理大型数据库

¥How to deal with large databases

Prerequisites

本指南假设你熟悉以下内容：

¥This guide assumes familiarity with the following:

基于 SQL 数据的问答
¥Question answering over SQL data

为了编写针对数据库的有效查询，我们需要向模型提供表名、表模式和特性值，以供其查询。当表、列和/或高基数列数量较多时，我们无法在每个提示符中转储有关数据库的完整信息。我们必须找到动态地将最相关的信息插入提示的方法。让我们来看看实现这一点的一些技巧。

¥In order to write valid queries against a database, we need to feed the model the table names, table schemas, and feature values for it to query over. When there are many tables, columns, and/or high-cardinality columns, it becomes impossible for us to dump the full information about our database in every prompt. Instead, we must find ways to dynamically insert into the prompt only the most relevant information. Let's take a look at some techniques for doing this.

设置

¥Setup

首先，安装所需的包并设置环境变量。本示例将使用 OpenAI 作为 LLM。

¥First, install the required packages and set your environment variables. This example will use OpenAI as the LLM.

npm install langchain @langchain/community @langchain/openai typeorm sqlite3

export OPENAI_API_KEY="your api key"
# Uncomment the below to use LangSmith. Not required.
# export LANGSMITH_API_KEY="your api key"
# export LANGSMITH_TRACING=true

# Reduce tracing latency if you are not in a serverless environment
# export LANGCHAIN_CALLBACKS_BACKGROUND=true

以下示例将使用与 Chinook 数据库的 SQLite 连接。按照以下安装步骤操作，在与本注意本相同的目录中创建 Chinook.db：

¥The below example will use a SQLite connection with Chinook database. Follow these installation steps to create Chinook.db in the same directory as this notebook:

将 this 文件保存为 Chinook_Sqlite.sql
¥Save this file as Chinook_Sqlite.sql
运行 sqlite3 Chinook.db
¥Run sqlite3 Chinook.db
运行 .read Chinook_Sqlite.sql
¥Run .read Chinook_Sqlite.sql
测试 SELECT * FROM Artist LIMIT 10;
¥Test SELECT * FROM Artist LIMIT 10;

现在，Chinhook.db 已在我们的目录中，我们可以使用 Typeorm 驱动的 SqlDatabase 类与其进行交互：

¥Now, Chinhook.db is in our directory and we can interface with it using the Typeorm-driven SqlDatabase class:

import { SqlDatabase } from "langchain/sql_db";
import { DataSource } from "typeorm";

const datasource = new DataSource({
  type: "sqlite",
  database: "../../../../Chinook.db",
});
const db = await SqlDatabase.fromDataSourceParams({
  appDataSource: datasource,
});
console.log(db.allTables.map((t) => t.tableName));
/**
[
  'Album',       'Artist',
  'Customer',    'Employee',
  'Genre',       'Invoice',
  'InvoiceLine', 'MediaType',
  'Playlist',    'PlaylistTrack',
  'Track'
]
 */

API Reference:

SqlDatabase from langchain/sql_db

多个表

¥Many tables

我们需要在提示中包含的主要信息之一是相关表的模式。当我们拥有非常多的表时，我们无法将所有模式都放在一个提示中。在这种情况下，我们可以做的是，首先提取与用户输入相关的表的名称，然后仅包含它们的模式。

¥One of the main pieces of information we need to include in our prompt is the schemas of the relevant tables. When we have very many tables, we can't fit all of the schemas in a single prompt. What we can do in such cases is first extract the names of the tables related to the user input, and then include only their schemas.

一种简单可靠的方法是使用 OpenAI 函数调用和 Zod 模型。LangChain 内置 createExtractionChainZod 链，让我们能够做到以下几点：

¥One easy and reliable way to do this is using OpenAI function-calling and Zod models. LangChain comes with a built-in createExtractionChainZod chain that lets us do just this:

import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";
import { ChatOpenAI } from "@langchain/openai";
import { createSqlQueryChain } from "langchain/chains/sql_db";
import { SqlDatabase } from "langchain/sql_db";
import { DataSource } from "typeorm";
import { z } from "zod";

const datasource = new DataSource({
  type: "sqlite",
  database: "../../../../Chinook.db",
});
const db = await SqlDatabase.fromDataSourceParams({
  appDataSource: datasource,
});
const llm = new ChatOpenAI({ model: "gpt-4", temperature: 0 });

const Table = z.object({
  names: z.array(z.string()).describe("Names of tables in SQL database"),
});

const tableNames = db.allTables.map((t) => t.tableName).join("\n");
const system = `Return the names of ALL the SQL tables that MIGHT be relevant to the user question.
The tables are:

${tableNames}

Remember to include ALL POTENTIALLY RELEVANT tables, even if you're not sure that they're needed.`;

const prompt = ChatPromptTemplate.fromMessages([
  ["system", system],
  ["human", "{input}"],
]);
const tableChain = prompt.pipe(llm.withStructuredOutput(Table));

console.log(
  await tableChain.invoke({
    input: "What are all the genres of Alanis Morisette songs?",
  })
);
/**
{ names: [ 'Artist', 'Track', 'Genre' ] }
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/5ca0c91e-4a40-44ef-8c45-9a4247dc474c/r

// -------------

/**
This works pretty well! Except, as we’ll see below, we actually need a few other tables as well.
This would be pretty difficult for the model to know based just on the user question.
In this case, we might think to simplify our model’s job by grouping the tables together.
We’ll just ask the model to choose between categories “Music” and “Business”, and then take care of selecting all the relevant tables from there:
 */

const prompt2 = ChatPromptTemplate.fromMessages([
  [
    "system",
    `Return the names of the SQL tables that are relevant to the user question.
  The tables are:
  
  Music
  Business`,
  ],
  ["human", "{input}"],
]);
const categoryChain = prompt2.pipe(llm.withStructuredOutput(Table));
console.log(
  await categoryChain.invoke({
    input: "What are all the genres of Alanis Morisette songs?",
  })
);
/**
{ names: [ 'Music' ] }
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/12b62e78-bfbe-42ff-86f2-ad738a476554/r

// -------------

const getTables = (categories: z.infer<typeof Table>): Array<string> => {
  let tables: Array<string> = [];
  for (const category of categories.names) {
    if (category === "Music") {
      tables = tables.concat([
        "Album",
        "Artist",
        "Genre",
        "MediaType",
        "Playlist",
        "PlaylistTrack",
        "Track",
      ]);
    } else if (category === "Business") {
      tables = tables.concat([
        "Customer",
        "Employee",
        "Invoice",
        "InvoiceLine",
      ]);
    }
  }
  return tables;
};

const tableChain2 = categoryChain.pipe(getTables);
console.log(
  await tableChain2.invoke({
    input: "What are all the genres of Alanis Morisette songs?",
  })
);
/**
[
  'Album',
  'Artist',
  'Genre',
  'MediaType',
  'Playlist',
  'PlaylistTrack',
  'Track'
]
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/e78c10aa-e923-4a24-b0c8-f7a6f5d316ce/r

// -------------

// Now that we’ve got a chain that can output the relevant tables for any query we can combine this with our createSqlQueryChain, which can accept a list of tableNamesToUse to determine which table schemas are included in the prompt:

const queryChain = await createSqlQueryChain({
  llm,
  db,
  dialect: "sqlite",
});

const tableChain3 = RunnableSequence.from([
  {
    input: (i: { question: string }) => i.question,
  },
  tableChain2,
]);

const fullChain = RunnablePassthrough.assign({
  tableNamesToUse: tableChain3,
}).pipe(queryChain);
const query = await fullChain.invoke({
  question: "What are all the genres of Alanis Morisette songs?",
});
console.log(query);
/**
SELECT DISTINCT "Genre"."Name"
FROM "Genre"
JOIN "Track" ON "Genre"."GenreId" = "Track"."GenreId"
JOIN "Album" ON "Track"."AlbumId" = "Album"."AlbumId"
JOIN "Artist" ON "Album"."ArtistId" = "Artist"."ArtistId"
WHERE "Artist"."Name" = 'Alanis Morissette'
LIMIT 5;
 */

console.log(await db.run(query));
/**
[{"Name":"Rock"}]
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/c7d576d0-3462-40db-9edc-5492f10555bf/r

// -------------

// We might rephrase our question slightly to remove redundancy in the answer
const query2 = await fullChain.invoke({
  question: "What is the set of all unique genres of Alanis Morisette songs?",
});
console.log(query2);
/**
SELECT DISTINCT Genre.Name FROM Genre
JOIN Track ON Genre.GenreId = Track.GenreId
JOIN Album ON Track.AlbumId = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Artist.Name = 'Alanis Morissette'
 */
console.log(await db.run(query2));
/**
[{"Name":"Rock"}]
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/6e80087d-e930-4f22-9b40-f7edb95a2145/r

// -------------

API Reference:

ChatPromptTemplate from @langchain/core/prompts
RunnablePassthrough from @langchain/core/runnables
RunnableSequence from @langchain/core/runnables
ChatOpenAI from @langchain/openai
createSqlQueryChain from langchain/chains/sql_db
SqlDatabase from langchain/sql_db

我们已经了解了如何在链中的提示中动态包含表模式的子集。解决这个问题的另一种可能方法是通过为代理提供一个工具，让代理自行决定何时查找表格。

¥We've seen how to dynamically include a subset of table schemas in a prompt within a chain. Another possible approach to this problem is to let an Agent decide for itself when to look up tables by giving it a Tool to do so.

高基数列

¥High-cardinality columns

高基数是指数据库中具有大量唯一值的列。这些列的特点是其数据条目具有高度的唯一性，例如个人名称、地址或产品序列号。高基数数据会给索引和查询带来挑战，因为它需要更复杂的策略才能有效地过滤和检索特定条目。

¥High-cardinality refers to columns in a database that have a vast range of unique values. These columns are characterized by a high level of uniqueness in their data entries, such as individual names, addresses, or product serial numbers. High-cardinality data can pose challenges for indexing and querying, as it requires more sophisticated strategies to efficiently filter and retrieve specific entries.

为了过滤包含专有名词（例如地址、歌曲名称或艺术家）的列，我们首先需要仔细检查拼写，以确保正确过滤数据。

¥In order to filter columns that contain proper nouns such as addresses, song names or artists, we first need to double-check the spelling in order to filter the data correctly.

一种简单的策略是创建一个向量存储，其中包含数据库中所有不同的专有名词。然后，我们可以查询每个用户输入的向量存储，并将最相关的专有名词注入提示中。

¥One naive strategy it to create a vector store with all the distinct proper nouns that exist in the database. We can then query that vector store each user input and inject the most relevant proper nouns into the prompt.

首先，我们需要每个实体的唯一值，为此我们定义一个函数，将结果解析为元素列表：

¥First we need the unique values for each entity we want, for which we define a function that parses the result into a list of elements:

import { DocumentInterface } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { createSqlQueryChain } from "langchain/chains/sql_db";
import { SqlDatabase } from "langchain/sql_db";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { DataSource } from "typeorm";

const datasource = new DataSource({
  type: "sqlite",
  database: "../../../../Chinook.db",
});
const db = await SqlDatabase.fromDataSourceParams({
  appDataSource: datasource,
});

async function queryAsList(database: any, query: string): Promise<string[]> {
  const res: Array<{ [key: string]: string }> = JSON.parse(
    await database.run(query)
  )
    .flat()
    .filter((el: any) => el != null);
  const justValues: Array<string> = res.map((item) =>
    Object.values(item)[0]
      .replace(/\b\d+\b/g, "")
      .trim()
  );
  return justValues;
}

let properNouns: string[] = await queryAsList(db, "SELECT Name FROM Artist");
properNouns = properNouns.concat(
  await queryAsList(db, "SELECT Title FROM Album")
);
properNouns = properNouns.concat(
  await queryAsList(db, "SELECT Name FROM Genre")
);

console.log(properNouns.length);
/**
647
 */
console.log(properNouns.slice(0, 5));
/**
[
  'AC/DC',
  'Accept',
  'Aerosmith',
  'Alanis Morissette',
  'Alice In Chains'
]
 */

// Now we can embed and store all of our values in a vector database:

const vectorDb = await MemoryVectorStore.fromTexts(
  properNouns,
  {},
  new OpenAIEmbeddings()
);
const retriever = vectorDb.asRetriever(15);

// And put together a query construction chain that first retrieves values from the database and inserts them into the prompt:

const system = `You are a SQLite expert. Given an input question, create a syntactically correct SQLite query to run.
Unless otherwise specified, do not return more than {top_k} rows.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values.
If filtering on a feature value make sure to check its spelling against this list first:

{proper_nouns}`;
const prompt = ChatPromptTemplate.fromMessages([
  ["system", system],
  ["human", "{input}"],
]);

const llm = new ChatOpenAI({ model: "gpt-4", temperature: 0 });

const queryChain = await createSqlQueryChain({
  llm,
  db,
  prompt,
  dialect: "sqlite",
});
const retrieverChain = RunnableSequence.from([
  (i: { question: string }) => i.question,
  retriever,
  (docs: Array<DocumentInterface>) =>
    docs.map((doc) => doc.pageContent).join("\n"),
]);
const chain = RunnablePassthrough.assign({
  proper_nouns: retrieverChain,
}).pipe(queryChain);

// To try out our chain, let’s see what happens when we try filtering on “elenis moriset”, a misspelling of Alanis Morissette, without and with retrieval:

// Without retrieval
const query = await queryChain.invoke({
  question: "What are all the genres of Elenis Moriset songs?",
  proper_nouns: "",
});
console.log("query", query);
/**
query SELECT DISTINCT Genre.Name
FROM Genre
JOIN Track ON Genre.GenreId = Track.GenreId
JOIN Album ON Track.AlbumId = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Artist.Name = 'Elenis Moriset'
LIMIT 5;
 */
console.log("db query results", await db.run(query));
/**
db query results []
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/b153cb9b-6fbb-43a8-b2ba-4c86715183b9/r

// -------------

// With retrieval:
const query2 = await chain.invoke({
  question: "What are all the genres of Elenis Moriset songs?",
});
console.log("query2", query2);
/**
query2 SELECT DISTINCT Genre.Name
FROM Genre
JOIN Track ON Genre.GenreId = Track.GenreId
JOIN Album ON Track.AlbumId = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Artist.Name = 'Alanis Morissette';
 */
console.log("db query results", await db.run(query2));
/**
db query results [{"Name":"Rock"}]
 */

// -------------

// You can see a LangSmith trace of the above chain here:
// https://smith.langchain.com/public/2f4f0e37-3b7f-47b5-837c-e2952489cac0/r

// -------------

API Reference:

DocumentInterface from @langchain/core/documents
ChatPromptTemplate from @langchain/core/prompts
RunnablePassthrough from @langchain/core/runnables
RunnableSequence from @langchain/core/runnables
ChatOpenAI from @langchain/openai
OpenAIEmbeddings from @langchain/openai
createSqlQueryChain from langchain/chains/sql_db
SqlDatabase from langchain/sql_db
MemoryVectorStore from langchain/vectorstores/memory

我们可以看到，通过检索，我们能够纠正拼写并返回有效结果。

¥We can see that with retrieval we're able to correct the spelling and get back a valid result.

解决这个问题的另一种可能方法是让代理自行决定何时查找专有名词。

¥Another possible approach to this problem is to let an Agent decide for itself when to look up proper nouns.

后续步骤

¥Next steps

现在你已经学习了一些用于改进 SQL 生成的提示策略。

¥You've now learned about some prompting strategies to improve SQL generation.

接下来，查看本节中的其他指南，例如如何验证查询。你可能还对查询分析指南关于处理高基数感兴趣。

¥Next, check out some of the other guides in this section, like how to validate queries. You might also be interested in the query analysis guide on handling high cardinality.

如何处理大型数据库

设置​

API Reference:

多个表​

API Reference:

高基数列​

API Reference:

后续步骤​

设置

多个表

高基数列

后续步骤