Skip to main content

Apify 数据集

¥Apify Dataset

本指南介绍如何使用 Apify 和 LangChain 从 Apify 数据集加载文档。

¥This guide shows how to use Apify with LangChain to load documents from an Apify Dataset.

概述

¥Overview

Apify 是一个用于网页抓取和数据提取的云平台,它提供了超过两千个现成的、被称为 Actors 的应用,可用于各种网页抓取、爬取和数据提取用例。

¥Apify is a cloud platform for web scraping and data extraction, which provides an ecosystem of more than two thousand ready-made apps called Actors for various web scraping, crawling, and data extraction use cases.

本指南介绍如何从 Apify 数据集(一种可扩展的仅追加存储,用于存储结构化网页爬取结果,例如产品列表或 Google SERP)加载文档,然后将其导出为 JSON、CSV 或 Excel 等各种格式。

¥This guide shows how to load documents from an Apify Dataset — a scalable append-only storage built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel.

数据集通常用于保存不同 Actor 的结果。例如,网站内容爬虫 Actor 会深度爬取文档、知识库、帮助中心或博客等网站,然后将网页的文本内容存储到数据集中,你可以将文档输入到向量数据库中并用于信息检索。另一个示例是 RAG Web 浏览器 Actor,它查询 Google 搜索,从结果中抓取排名前 N 的页面,并以 Markdown 格式返回清理后的内容,以供大型语言模型进一步处理。

¥Datasets are typically used to save results of different Actors. For example, Website Content Crawler Actor deeply crawls websites such as documentation, knowledge bases, help centers, or blogs, and then stores the text content of webpages into a dataset, from which you can feed the documents into a vector database and use it for information retrieval. Another example is the RAG Web Browser Actor, which queries Google Search, scrapes the top N pages from the results, and returns the cleaned content in Markdown format for further processing by a large language model.

设置

¥Setup

你首先需要安装官方 Apify 客户端:

¥You'll first need to install the official Apify client:

npm install apify-client
npm install hnswlib-node @langchain/openai @langchain/community @langchain/core

你还需要注册并检索你的 Apify API 令牌

¥You'll also need to sign up and retrieve your Apify API token.

用法

¥Usage

从新数据集(爬取网站并将数据存储在 Apify 数据集中)

¥From a New Dataset (Crawl a Website and Store the data in Apify Dataset)

如果你在 Apify 平台上尚无现有数据集,则需要通过调用 Actor 并等待结果来初始化文档加载器。在下面的示例中,我们使用 网站内容爬虫 Actor 抓取 LangChain 文档,将结果存储在 Apify Dataset 中,然后使用 ApifyDatasetLoader 加载数据集。为了演示,我们将使用快速的 Cheerio 爬虫类型,并将爬取的页面数量限制为 10 个。

¥If you don't already have an existing dataset on the Apify platform, you'll need to initialize the document loader by calling an Actor and waiting for the results. In the example below, we use the Website Content Crawler Actor to crawl LangChain documentation, store the results in Apify Dataset, and then load the dataset using the ApifyDatasetLoader. For this demonstration, we'll use a fast Cheerio crawler type and limit the number of crawled pages to 10.

注意:运行网站内容爬虫可能需要一些时间,具体取决于网站的大小。对于大型网站,这可能需要几个小时甚至几天的时间!

¥Note: Running the Website Content Crawler may take some time, depending on the size of the website. For large sites, it can take several hours or even days!

以下是一个例子:

¥Here's an example:

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
maxCrawlPages: 10,
crawlerType: "cheerio",
startUrls: [{ url: "https://js.langchain.com/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
}
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
temperature: 0,
apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/

API Reference:

从现有数据集

¥From an Existing Dataset

如果你已经运行 Actor 并在 Apify 平台上拥有现有数据集,则可以使用构造函数直接初始化文档加载器。

¥If you've already run an Actor and have an existing dataset on the Apify platform, you can initialize the document loader directly using the constructor

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);

const model = new ChatOpenAI({
temperature: 0,
apiKey: OPENAI_API_KEY,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.com/docs/',
'https://js.langchain.com/docs/modules/chains/',
'https://js.langchain.com/docs/modules/chains/llmchain/',
'https://js.langchain.com/docs/category/functions-4'
]
*/

API Reference: