Skip to main content

Docx 文件

¥Docx files

DocxLoader 允许你从 Microsoft Word 文档中提取文本数据。它支持现代 .docx 格式和传统 .doc 格式。根据文件类型,需要其他依赖。

¥The DocxLoader allows you to extract text data from Microsoft Word documents. It supports both the modern .docx format and the legacy .doc format. Depending on the file type, additional dependencies are required.


设置

¥Setup

要使用 DocxLoader,你需要 @langchain/community 集成以及 mammothword-extractor 软件包:

¥To use DocxLoader, you'll need the @langchain/community integration along with either mammoth or word-extractor package:

  • mammoth:用于处理 .docx 文件。

    ¥mammoth: For processing .docx files.

  • word-extractor:用于处理 .doc 文件。

    ¥word-extractor: For handling .doc files.

安装

¥Installation

对于 .docx 文件

¥For .docx Files

npm install @langchain/community @langchain/core mammoth

对于 .doc 文件

¥For .doc Files

npm install @langchain/community @langchain/core word-extractor

用法

¥Usage

加载 .docx 文件

¥Loading .docx Files

对于 .docx 文件,初始化加载器时无需明确指定任何参数:

¥For .docx files, there is no need to explicitly specify any parameters when initializing the loader:

import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";

const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.docx"
);

const docs = await loader.load();

加载 .doc 文件

¥Loading .doc Files

对于 .doc 文件,初始化加载器时必须明确将 type 指定为 doc

¥For .doc files, you must explicitly specify the type as doc when initializing the loader:

import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";

const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.doc",
{
type: "doc",
}
);

const docs = await loader.load();