Skip to main content

Cassandra

Compatibility

仅在 Node.js 上可用。

¥Only available on Node.js.

Apache Cassandra ® 是一个 NoSQL、面向行、高度可扩展且高度可用的数据库。

¥Apache Cassandra® is a NoSQL, row-oriented, highly scalable and highly available database.

Apache Cassandra 的 最新版本 模块原生支持向量相似性搜索。

¥The latest version of Apache Cassandra natively supports Vector Similarity Search.

设置

¥Setup

首先,安装 Cassandra Node.js 驱动程序:

¥First, install the Cassandra Node.js driver:

npm install cassandra-driver @langchain/community @langchain/openai @langchain/core

根据数据库提供商的不同,连接数据库的具体方法会有所不同。我们将创建一个文档 configConnection,它将用作向量存储配置的一部分。

¥Depending on your database providers, the specifics of how to connect to the database will vary. We will create a document configConnection which will be used as part of the vector store configuration.

Apache Cassandra ®

Apache Cassandra® 5.0 及更高版本支持向量搜索。你可以使用标准连接文档,例如:

¥Vector search is supported in Apache Cassandra® 5.0 and above. You can use a standard connection document, for example:

const configConnection = {
contactPoints: ['h1', 'h2'],
localDataCenter: 'datacenter1',
credentials: {
username: <...> as string,
password: <...> as string,
},
};

Astra 数据库

¥Astra DB

Astra DB 是一个云原生的 Cassandra 即服务平台。

¥Astra DB is a cloud-native Cassandra-as-a-Service platform.

  1. 创建一个 Astra DB 账户

    ¥Create an Astra DB account.

  2. 创建一个 支持向量的数据库

    ¥Create a vector enabled database.

  3. 为你的数据库创建一个 token

    ¥Create a token for your database.

const configConnection = {
serviceProviderArgs: {
astra: {
token: <...> as string,
endpoint: <...> as string,
},
},
};

你可以提供属性 datacenterID: 和可选的 regionName:,而不是 endpoint:

¥Instead of endpoint:, you many provide property datacenterID: and optionally regionName:.

索引文档

¥Indexing docs

import { CassandraStore } from "langchain/vectorstores/cassandra";
import { OpenAIEmbeddings } from "@langchain/openai";

// The configConnection document is defined above
const config = {
...configConnection,
keyspace: "test",
dimensions: 1536,
table: "test",
indices: [{ name: "name", value: "(name)" }],
primaryKey: {
name: "id",
type: "int",
},
metadataColumns: [
{
name: "name",
type: "text",
},
],
};

const vectorStore = await CassandraStore.fromTexts(
["I am blue", "Green yellow purple", "Hello there hello"],
[
{ id: 2, name: "2" },
{ id: 1, name: "1" },
{ id: 3, name: "3" },
],
new OpenAIEmbeddings(),
cassandraConfig
);

查询文档

¥Querying docs

const results = await vectorStore.similaritySearch("Green yellow purple", 1);

或过滤查询:

¥or filtered query:

const results = await vectorStore.similaritySearch("B", 1, { name: "Bubba" });

向量类型

¥Vector Types

Cassandra 支持 cosine(默认)、dot_producteuclidean 相似性搜索;这在向量存储首次创建时进行定义,并在构造函数参数 vectorType 中指定,例如:

¥Cassandra supports cosine (the default), dot_product, and euclidean similarity search; this is defined when the vector store is first created, and specifed in the constructor parameter vectorType, for example:

  ...,
vectorType: "dot_product",
...

索引

¥Indices

Cassandra 在版本 5 中引入了存储附加索引 (SAI)。它们允许在不指定分区键的情况下进行 WHERE 过滤,并允许使用其他运算符类型,例如非等式。你可以使用 indices 参数定义这些函数,该参数接受零个或多个字典,每个字典包含 namevalue 条目。

¥With Version 5, Cassandra introduced Storage Attached Indexes, or SAIs. These allow WHERE filtering without specifying the partition key, and allow for additional operator types such as non-equalities. You can define these with the indices parameter, which accepts zero or more dictionaries each containing name and value entries.

索引是可选的,但如果在非分区列上使用过滤查询,则索引是必需的。

¥Indices are optional, though required if using filtered queries on non-partition columns.

  • name 条目是对象名称的一部分;在名为 test_table 的表上,包含 name: "some_column" 的索引将是 idx_test_table_some_column

    ¥The name entry is part of the object name; on a table named test_table an index with name: "some_column" would be idx_test_table_some_column.

  • value 条目是创建索引的列,被 () 包围。对于上面的列 some_column,它将被指定为 value: "(some_column)"

    ¥The value entry is the column on which the index is created, surrounded by ( and ). With the above column some_column it would be specified as value: "(some_column)".

  • 可选的 options 条目是传递给 CREATE CUSTOM INDEX 语句的 WITH OPTIONS = 子句的映射。此图谱中的具体条目与索引类型相关。

    ¥An optional options entry is a map passed to the WITH OPTIONS = clause of the CREATE CUSTOM INDEX statement. The specific entries on this map are index type specific.

  indices: [{ name: "some_column", value: "(some_column)" }],

高级过滤

¥Advanced Filtering

默认情况下,过滤器使用等式 = 进行应用。对于包含 indices 条目的字段,你可以提供一个 operator,其中包含索引支持的字符串值;在这种情况下,你可以指定一个或多个过滤器,可以是单例形式,也可以是列表形式(这些过滤器将通过 AND 合并在一起)。例如:

¥By default, filters are applied with an equality =. For those fields that have an indices entry, you may provide an operator with a string of a value supported by the index; in this case, you specify one or more filters, as either a singleton or in a list (which will be AND-ed together). For example:

   { name: "create_datetime", operator: ">", value: some_datetime_variable }

or

[
{ userid: userid_variable },
{ name: "create_datetime", operator: ">", value: some_date_variable },
];

value 可以是单个值或数组。如果它不是数组,或者 value 中只有一个元素,则生成的查询将类似于 ${name} ${operator} ?,其中 value 绑定到 ?

¥value can be a single value or an array. If it is not an array, or there is only one element in value, the resulting query will be along the lines of ${name} ${operator} ? with value bound to the ?.

如果 value 数组中有多个元素,则计算 name 中未引用的 ? 的数量,并从 value 的长度中减去该数量,并将 ? 的长度放在运算符的右侧;如果有多个 ?,则它们将被封装在 () 中,例如 (?, ?, ?)

¥If there is more than one element in the value array, the number of unquoted ? in name are counted and subtracted from the length of value, and this number of ? is put on the right side of the operator; if there are more than one ? then they will be encapsulated in ( and ), e.g. (?, ?, ?).

这方便在运算符左侧绑定值,这对某些函数很有用;例如,地理距离过滤器:

¥This faciliates bind values on the left of the operator, which is useful for some functions; for example a geo-distance filter:

{
name: "GEO_DISTANCE(coord, ?)",
operator: "<",
value: [new Float32Array([53.3730617,-6.3000515]), 10000],
},

数据分区和复合键

¥Data Partitioning and Composite Keys

在某些系统中,你可能希望出于各种原因对数据进行分区,例如按用户或按会话。Cassandra 中的数据始终是分区的;默认情况下,此库将按第一个主键字段进行分区。你可以指定构成记录主键(唯一键)的多个列,并可选地指定应作为分区键一部分的字段。例如,向量存储可以同时按 useridcollectionid 进行分区,并添加字段 dociddocpart 使单个条目唯一:

¥In some systems, you may wish to partition the data for various reasons, perhaps by user or by session. Data in Cassandra is always partitioned; by default this library will partition by the first primary key field. You may specify multiple columns which comprise the primary (unique) key of a record, and optionally indicate those fields which should be part of the partition key. For example, the vector store could be partitioned by both userid and collectionid, with additional fields docid and docpart making an individual entry unique:

  ...,
primaryKey: [
{name: "userid", type: "text", partition: true},
{name: "collectionid", type: "text", partition: true},
{name: "docid", type: "text"},
{name: "docpart", type: "int"},
],
...

搜索时,你可以在过滤器中包含分区键,而无需为这些列定义 indices;你无需指定所有分区键,但必须首先指定键中的分区键。在上面的示例中,你可以指定 {userid: userid_variable}{userid: userid_variable, collectionid: collectionid_variable} 的过滤器,但如果你想仅指定 {collectionid: collectionid_variable} 的过滤器,则必须在 indices 列表中包含 collectionid

¥When searching, you may include partition keys on the filter without defining indices for these columns; you do not need to specify all partition keys, but must specify those in the key first. In the above example, you could specify a filter of {userid: userid_variable} and {userid: userid_variable, collectionid: collectionid_variable}, but if you wanted to specify a filter of only {collectionid: collectionid_variable} you would have to include collectionid on the indices list.

其他配置选项

¥Additional Configuration Options

在配置文档中,提供了更多可选参数;它们的默认值为:

¥In the configuration document, further optional parameters are provided; their defaults are:

  ...,
maxConcurrency: 25,
batchSize: 1,
withClause: "",
...
ParameterUsage
maxConcurrencyHow many concurrent requests will be sent to Cassandra at a given time.
batchSizeHow many documents will be sent on a single request to Cassandra. When using a value > 1, you should ensure your batch size will not exceed the Cassandra parameter batch_size_fail_threshold_in_kb. Batches are unlogged.
withClauseCassandra tables may be created with an optional WITH clause; this is generally not needed but provided for completeness.

¥Related