文本分割器

¥Text splitters

[Prerequisites]

文档
¥Documents
令牌化
¥Tokenization

概述

¥Overview

文档拆分通常是许多应用的关键预处理步骤。它涉及将大文本分解成更小、更易于管理的块。此过程具有多种优势，例如确保对不同文档长度的一致性处理、克服模型的输入大小限制以及提高检索系统中使用的文本表示的质量。拆分文档有几种策略，每种策略都有其自身的优势。

¥Document splitting is often a crucial preprocessing step for many applications. It involves breaking down large texts into smaller, manageable chunks. This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. There are several strategies for splitting documents, each with its own advantages.

关键概念

¥Key concepts

Conceptual Overview

文本拆分器将文档拆分成更小的块，以便在下游应用中使用。

¥Text splitters split documents into smaller chunks for use in downstream applications.

为什么要拆分文档？

¥Why split documents?

拆分文档的原因有几个：

¥There are several reasons to split documents:

处理非均匀文档长度：现实世界中的文档集合通常包含大小不一的文本。拆分可确保所有文档的处理一致性。
¥Handling non-uniform document lengths: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
克服模型限制：许多嵌入模型和语言模型都有最大输入大小限制。拆分使我们能够处理原本超出这些限制的文档。
¥Overcoming model limitations: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits.
提升表示质量：对于较长的文档，嵌入或其他表示的质量可能会下降，因为它们试图捕获过多的信息。拆分可以使每个部分更加集中和准确。
¥Improving representation quality: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section.
提高检索精度：在信息检索系统中，拆分可以提高搜索结果的粒度，从而允许更精确地将查询与相关文档部分匹配。
¥Enhancing retrieval precision: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections.
优化计算资源：处理较小的文本块可以提高内存效率，并允许更好地并行化处理任务。
¥Optimizing computational resources: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks.

现在，下一个问题是如何将文档拆分成块！拆分策略有几种，每种策略都有其自身的优势。

¥Now, the next question is how to split the documents into chunks! There are several strategies, each with its own advantages.

[Further reading]

请参阅 Greg Kamradt 的 chunkviz 文章，其中可视化了下文讨论的不同拆分策略。
¥See Greg Kamradt's chunkviz to visualize different splitting strategies discussed below.

方法

¥Approaches

基于长度

¥Length-based

最直观的策略是根据文档长度进行拆分。这种简单而有效的方法可确保每个块不超过指定的大小限制。基于长度拆分的主要优势：

¥The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit. Key benefits of length-based splitting:

简单的实现
¥Straightforward implementation
一致的块大小
¥Consistent chunk sizes
轻松适应不同的模型需求
¥Easily adaptable to different model requirements

基于长度的拆分类型：

¥Types of length-based splitting:

基于令牌：根据标记数拆分文本，这在使用语言模型时非常有用。
¥Token-based: Splits text based on the number of tokens, which is useful when working with language models.
基于角色：根据字符数拆分文本，这可以使不同类型的文本更加一致。
¥Character-based: Splits text based on the number of characters, which can be more consistent across different types of text.

使用 LangChain 的 CharacterTextSplitter 进行基于字符的拆分的示例实现：

¥Example implementation using LangChain's CharacterTextSplitter with character based splitting:

import { CharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new CharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document);

[Further reading]

查看 token-based 拆分的操作指南。
¥See the how-to guide for token-based splitting.
查看 character-based 拆分的操作指南。
¥See the how-to guide for character-based splitting.

基于文本结构

¥Text-structured based

文本自然地组织成层次单元，例如段落、句子和单词。我们可以利用这种固有结构来指导我们的拆分策略，创建能够保持自然语言流畅、在拆分过程中保持语义连贯性并适应不同文本粒度级别的拆分。LangChain 的 RecursiveCharacterTextSplitter 实现了这一概念：

¥Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity. LangChain's RecursiveCharacterTextSplitter implements this concept:

RecursiveCharacterTextSplitter 试图保持较大的单元（例如段落）的完整性。
¥The RecursiveCharacterTextSplitter attempts to keep larger units (e.g., paragraphs) intact.
如果某个单元超出了块大小，它将移至下一级（例如，句子）。
¥If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
如有必要，此过程会持续到单词级别。
¥This process continues down to the word level if necessary.

以下是示例用法：

¥Here is example usage:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document);

[Further reading]

查看递归文本拆分的操作指南。
¥See the how-to guide for recursive text splitting.

基于文档结构

¥Document-structured based

某些文档具有固有结构，例如 HTML、Markdown 或 JSON 文件。在这些情况下，根据文档结构进行拆分非常有益，因为它通常会自然地将语义相关的文本分组。基于结构拆分的主要优势：

¥Some documents have an inherent structure, such as HTML, Markdown, or JSON files. In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text. Key benefits of structure-based splitting:

保留文档的逻辑结构
¥Preserves the logical organization of the document
维护每个块内的上下文
¥Maintains context within each chunk
可以更有效地执行检索或摘要等下游任务
¥Can be more effective for downstream tasks like retrieval or summarization

基于结构的拆分示例：

¥Examples of structure-based splitting:

Markdown：根据标题（例如 #、##、###）拆分
¥Markdown: Split based on headers (e.g., #, ##, ###)
HTML:使用标签拆分
¥HTML: Split using tags
JSON:按对象或数组元素拆分
¥JSON: Split by object or array elements
代码：按函数、类或逻辑块拆分
¥Code: Split by functions, classes, or logical blocks

[Further reading]

查看代码拆分的操作指南。
¥See the how-to guide for Code splitting.

基于语义

¥Semantic meaning based

与之前的方法不同，基于语义的拆分实际上考虑的是文本的内容。其他方法使用文档或文本结构作为语义含义的代理，而该方法直接分析文本的语义。实现此目标的方法有很多种，但从概念上讲，该方法是在文本含义发生重大变化时进行拆分。例如，我们可以使用滑动窗口方法生成嵌入，并比较嵌入以找出显著差异：

¥Unlike the previous methods, semantic-based splitting actually considers the content of the text. While other approaches use document or text structure as proxies for semantic meaning, this method directly analyzes the text's semantics. There are several ways to implement this, but conceptually the approach is split text when there are significant changes in text meaning. As an example, we can use a sliding window approach to generate embeddings, and compare the embeddings to find significant differences:

从前几句话开始并生成嵌入。
¥Start with the first few sentences and generate an embedding.
移至下一组句子并生成另一个嵌入（例如，使用滑动窗口方法）。
¥Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach).
比较嵌入以发现显著差异，这些差异表明语义部分之间可能存在 "断点"。
¥Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections.

此技术有助于创建语义更连贯的块，从而可能提高下游任务（如检索或摘要）的质量。

¥This technique helps create chunks that are more semantically coherent, potentially improving the quality of downstream tasks like retrieval or summarization.

[Further reading]

请参阅 Greg Kamradt 的 notebook 文章，其中展示了语义拆分。
¥See Greg Kamradt's notebook showcasing semantic splitting.

文本分割器

概述​

关键概念​

为什么要拆分文档？​

方法​

基于长度​

基于文本结构​

基于文档结构​

基于语义​

概述

关键概念

为什么要拆分文档？

方法

基于长度

基于文本结构

基于文档结构

基于语义