Cassandra
仅在 Node.js 上可用。
¥Only available on Node.js.
Apache Cassandra ® 是一个 NoSQL、面向行、高度可扩展且高度可用的数据库。
¥Apache Cassandra® is a NoSQL, row-oriented, highly scalable and highly available database.
Apache Cassandra 的 最新版本 模块原生支持向量相似性搜索。
¥The latest version of Apache Cassandra natively supports Vector Similarity Search.
设置
¥Setup
首先,安装 Cassandra Node.js 驱动程序:
¥First, install the Cassandra Node.js driver:
- npm
- Yarn
- pnpm
npm install cassandra-driver @langchain/community @langchain/openai @langchain/core
yarn add cassandra-driver @langchain/community @langchain/openai @langchain/core
pnpm add cassandra-driver @langchain/community @langchain/openai @langchain/core
根据数据库提供商的不同,连接数据库的具体方法会有所不同。我们将创建一个文档 configConnection
,它将用作向量存储配置的一部分。
¥Depending on your database providers, the specifics of how to connect to the database will vary. We will create a document configConnection
which will be used as part of the vector store configuration.
Apache Cassandra ®
Apache Cassandra® 5.0 及更高版本支持向量搜索。你可以使用标准连接文档,例如:
¥Vector search is supported in Apache Cassandra® 5.0 and above. You can use a standard connection document, for example:
const configConnection = {
contactPoints: ['h1', 'h2'],
localDataCenter: 'datacenter1',
credentials: {
username: <...> as string,
password: <...> as string,
},
};
Astra 数据库
¥Astra DB
Astra DB 是一个云原生的 Cassandra 即服务平台。
¥Astra DB is a cloud-native Cassandra-as-a-Service platform.
创建一个 Astra DB 账户。
¥Create an Astra DB account.
创建一个 支持向量的数据库。
¥Create a vector enabled database.
为你的数据库创建一个 token。
¥Create a token for your database.
const configConnection = {
serviceProviderArgs: {
astra: {
token: <...> as string,
endpoint: <...> as string,
},
},
};
你可以提供属性 datacenterID:
和可选的 regionName:
,而不是 endpoint:
。
¥Instead of endpoint:
, you many provide property datacenterID:
and optionally regionName:
.
索引文档
¥Indexing docs
import { CassandraStore } from "langchain/vectorstores/cassandra";
import { OpenAIEmbeddings } from "@langchain/openai";
// The configConnection document is defined above
const config = {
...configConnection,
keyspace: "test",
dimensions: 1536,
table: "test",
indices: [{ name: "name", value: "(name)" }],
primaryKey: {
name: "id",
type: "int",
},
metadataColumns: [
{
name: "name",
type: "text",
},
],
};
const vectorStore = await CassandraStore.fromTexts(
["I am blue", "Green yellow purple", "Hello there hello"],
[
{ id: 2, name: "2" },
{ id: 1, name: "1" },
{ id: 3, name: "3" },
],
new OpenAIEmbeddings(),
cassandraConfig
);
查询文档
¥Querying docs
const results = await vectorStore.similaritySearch("Green yellow purple", 1);
或过滤查询:
¥or filtered query:
const results = await vectorStore.similaritySearch("B", 1, { name: "Bubba" });
向量类型
¥Vector Types
Cassandra 支持 cosine
(默认)、dot_product
和 euclidean
相似性搜索;这在向量存储首次创建时进行定义,并在构造函数参数 vectorType
中指定,例如:
¥Cassandra supports cosine
(the default), dot_product
, and euclidean
similarity search; this is defined when the
vector store is first created, and specifed in the constructor parameter vectorType
, for example:
...,
vectorType: "dot_product",
...
索引
¥Indices
Cassandra 在版本 5 中引入了存储附加索引 (SAI)。它们允许在不指定分区键的情况下进行 WHERE
过滤,并允许使用其他运算符类型,例如非等式。你可以使用 indices
参数定义这些函数,该参数接受零个或多个字典,每个字典包含 name
和 value
条目。
¥With Version 5, Cassandra introduced Storage Attached Indexes, or SAIs. These allow WHERE
filtering without specifying
the partition key, and allow for additional operator types such as non-equalities. You can define these with the indices
parameter, which accepts zero or more dictionaries each containing name
and value
entries.
索引是可选的,但如果在非分区列上使用过滤查询,则索引是必需的。
¥Indices are optional, though required if using filtered queries on non-partition columns.
name
条目是对象名称的一部分;在名为test_table
的表上,包含name: "some_column"
的索引将是idx_test_table_some_column
。¥The
name
entry is part of the object name; on a table namedtest_table
an index withname: "some_column"
would beidx_test_table_some_column
.value
条目是创建索引的列,被(
和)
包围。对于上面的列some_column
,它将被指定为value: "(some_column)"
。¥The
value
entry is the column on which the index is created, surrounded by(
and)
. With the above columnsome_column
it would be specified asvalue: "(some_column)"
.可选的
options
条目是传递给CREATE CUSTOM INDEX
语句的WITH OPTIONS =
子句的映射。此图谱中的具体条目与索引类型相关。¥An optional
options
entry is a map passed to theWITH OPTIONS =
clause of theCREATE CUSTOM INDEX
statement. The specific entries on this map are index type specific.
indices: [{ name: "some_column", value: "(some_column)" }],
高级过滤
¥Advanced Filtering
默认情况下,过滤器使用等式 =
进行应用。对于包含 indices
条目的字段,你可以提供一个 operator
,其中包含索引支持的字符串值;在这种情况下,你可以指定一个或多个过滤器,可以是单例形式,也可以是列表形式(这些过滤器将通过 AND
合并在一起)。例如:
¥By default, filters are applied with an equality =
. For those fields that have an indices
entry, you may
provide an operator
with a string of a value supported by the index; in this case, you specify one or
more filters, as either a singleton or in a list (which will be AND
-ed together). For example:
{ name: "create_datetime", operator: ">", value: some_datetime_variable }
or
[
{ userid: userid_variable },
{ name: "create_datetime", operator: ">", value: some_date_variable },
];
value
可以是单个值或数组。如果它不是数组,或者 value
中只有一个元素,则生成的查询将类似于 ${name} ${operator} ?
,其中 value
绑定到 ?
。
¥value
can be a single value or an array. If it is not an array, or there is only one element in value
,
the resulting query will be along the lines of ${name} ${operator} ?
with value
bound to the ?
.
如果 value
数组中有多个元素,则计算 name
中未引用的 ?
的数量,并从 value
的长度中减去该数量,并将 ?
的长度放在运算符的右侧;如果有多个 ?
,则它们将被封装在 (
和 )
中,例如 (?, ?, ?)
。
¥If there is more than one element in the value
array, the number of unquoted ?
in name
are counted
and subtracted from the length of value
, and this number of ?
is put on the right side of the operator;
if there are more than one ?
then they will be encapsulated in (
and )
, e.g. (?, ?, ?)
.
这方便在运算符左侧绑定值,这对某些函数很有用;例如,地理距离过滤器:
¥This faciliates bind values on the left of the operator, which is useful for some functions; for example a geo-distance filter:
{
name: "GEO_DISTANCE(coord, ?)",
operator: "<",
value: [new Float32Array([53.3730617,-6.3000515]), 10000],
},
数据分区和复合键
¥Data Partitioning and Composite Keys
在某些系统中,你可能希望出于各种原因对数据进行分区,例如按用户或按会话。Cassandra 中的数据始终是分区的;默认情况下,此库将按第一个主键字段进行分区。你可以指定构成记录主键(唯一键)的多个列,并可选地指定应作为分区键一部分的字段。例如,向量存储可以同时按 userid
和 collectionid
进行分区,并添加字段 docid
和 docpart
使单个条目唯一:
¥In some systems, you may wish to partition the data for various reasons, perhaps by user or by session. Data in Cassandra
is always partitioned; by default this library will partition by the first primary key field. You may specify multiple
columns which comprise the primary (unique) key of a record, and optionally indicate those fields which should be
part of the partition key. For example, the vector store could be partitioned by both userid
and collectionid
, with
additional fields docid
and docpart
making an individual entry unique:
...,
primaryKey: [
{name: "userid", type: "text", partition: true},
{name: "collectionid", type: "text", partition: true},
{name: "docid", type: "text"},
{name: "docpart", type: "int"},
],
...
搜索时,你可以在过滤器中包含分区键,而无需为这些列定义 indices
;你无需指定所有分区键,但必须首先指定键中的分区键。在上面的示例中,你可以指定 {userid: userid_variable}
和 {userid: userid_variable, collectionid: collectionid_variable}
的过滤器,但如果你想仅指定 {collectionid: collectionid_variable}
的过滤器,则必须在 indices
列表中包含 collectionid
。
¥When searching, you may include partition keys on the filter without defining indices
for these columns; you do
not need to specify all partition keys, but must specify those in the key first. In the above example, you could
specify a filter of {userid: userid_variable}
and {userid: userid_variable, collectionid: collectionid_variable}
,
but if you wanted to specify a filter of only {collectionid: collectionid_variable}
you would have to include
collectionid
on the indices
list.
其他配置选项
¥Additional Configuration Options
在配置文档中,提供了更多可选参数;它们的默认值为:
¥In the configuration document, further optional parameters are provided; their defaults are:
...,
maxConcurrency: 25,
batchSize: 1,
withClause: "",
...
Parameter | Usage |
---|---|
maxConcurrency | How many concurrent requests will be sent to Cassandra at a given time. |
batchSize | How many documents will be sent on a single request to Cassandra. When using a value > 1, you should ensure your batch size will not exceed the Cassandra parameter batch_size_fail_threshold_in_kb . Batches are unlogged. |
withClause | Cassandra tables may be created with an optional WITH clause; this is generally not needed but provided for completeness. |
相关
¥Related
向量存储 概念指南
¥Vector store conceptual guide
向量存储 操作指南
¥Vector store how-to guides