如何加载 PDF 文件
¥How to load PDF files
可移植文档格式 (PDF),标准化为 ISO 32000,是 Adobe 于 1992 年开发的一种文件格式,用于以独立于应用软件、硬件和操作系统的方式呈现文档(包括文本格式和图片)。
¥Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.
本概述介绍如何将 PDF
文档加载为我们下游使用的文档格式。
¥This covers how to load PDF
documents into the Document format that we use downstream.
默认情况下,PDF 文件中的每个页面都会创建一个文档。你可以通过将 splitPages
选项设置为 false
来更改此行为。
¥By default, one document will be created for each page in the PDF file. You can change this behavior by setting the splitPages
option to false
.
设置
¥Setup
- npm
- Yarn
- pnpm
npm install pdf-parse
yarn add pdf-parse
pnpm add pdf-parse
用法,每页一个文档
¥Usage, one document per page
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
// Or, in web environments:
// import { WebPDFLoader } from "@langchain/community/document_loaders/web/pdf";
// const blob = new Blob(); // e.g. from a file input
// const loader = new WebPDFLoader(blob);
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf");
const docs = await loader.load();
用法,每个文件一个文档
¥Usage, one document per file
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
splitPages: false,
});
const docs = await loader.load();
用法,自定义 pdfjs
构建
¥Usage, custom pdfjs
build
默认情况下,我们使用与 pdf-parse
打包在一起的 pdfjs
版本,它与大多数环境兼容,包括 Node.js 和现代浏览器。如果你想使用较新版本的 pdfjs-dist
或自定义版本的 pdfjs-dist
,可以通过提供一个自定义的 pdfjs
函数来实现,该函数返回一个解析为 PDFJS
对象的 Promise。
¥By default we use the pdfjs
build bundled with pdf-parse
, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of pdfjs-dist
or if you want to use a custom build of pdfjs-dist
, you can do so by providing a custom pdfjs
function that returns a promise that resolves to the PDFJS
object.
在下面的示例中,我们使用了 pdfjs-dist
的 "legacy"(参见 pdfjs 文档)版本,其中包含默认版本中未包含的几个 polyfill。
¥In the following example we use the "legacy" (see pdfjs docs) build of pdfjs-dist
, which includes several polyfills not included in the default build.
- npm
- Yarn
- pnpm
npm install pdfjs-dist
yarn add pdfjs-dist
pnpm add pdfjs-dist
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
// you may need to add `.then(m => m.default)` to the end of the import
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});
消除多余空格
¥Eliminating extra spaces
PDF 种类繁多,这使得阅读它们成为一项挑战。加载器默认解析单个文本元素并用空格将它们连接在一起,但如果你看到过多的空格,这可能不是你想要的行为。在这种情况下,你可以使用空字符串覆盖分隔符,如下所示:
¥PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
parsedItemSeparator: "",
});
const docs = await loader.load();