Skip to main content

带有 Playwright 的网页

¥Webpages, with Playwright

Compatibility

仅在 Node.js 上可用。

¥Only available on Node.js.

本示例介绍如何使用 Playwright 从网页加载数据。每个网页都会创建一个文档。

¥This example goes over how to load data from webpages using Playwright. One document will be created for each webpage.

Playwright 是一个 Node.js 库,它提供了用于控制多种浏览器引擎(包括 Chromium、Firefox 和 WebKit)的高级 API。你可以使用 Playwright 自动化网页交互,包括从需要 JavaScript 渲染的动态网页中提取数据。

¥Playwright is a Node.js library that provides a high-level API for controlling multiple browser engines, including Chromium, Firefox, and WebKit. You can use Playwright to automate web page interactions, including extracting data from dynamic web pages that require JavaScript to render.

如果你想要一个更轻量级的解决方案,并且你想要加载的网页不需要 JavaScript 来渲染,你可以改用 CheerioWebBaseLoader

¥If you want a lighterweight solution, and the webpages you want to load do not require JavaScript to render, you can use the CheerioWebBaseLoader instead.

设置

¥Setup

npm install @langchain/community @langchain/core playwright

用法

¥Usage

import { PlaywrightWebBaseLoader } from "@langchain/community/document_loaders/web/playwright";

/**

* Loader uses `page.content()`

* as default evaluate function
**/
const loader = new PlaywrightWebBaseLoader("https://www.tabnews.com.br/");

const docs = await loader.load();

选项

¥Options

以下是使用 PlaywrightWebBaseLoaderOptions 接口传递给 PlaywrightWebBaseLoader 构造函数的参数说明:

¥Here's an explanation of the parameters you can pass to the PlaywrightWebBaseLoader constructor using the PlaywrightWebBaseLoaderOptions interface:

type PlaywrightWebBaseLoaderOptions = {
launchOptions?: LaunchOptions;
gotoOptions?: PlaywrightGotoOptions;
evaluate?: PlaywrightEvaluate;
};
  1. launchOptions:一个可选对象,用于指定要传递给 playwright.chromium.launch() 方法的附加选项。这可以包含诸如 headless 标志之类的选项,用于以无头模式启动浏览器。

    ¥launchOptions: an optional object that specifies additional options to pass to the playwright.chromium.launch() method. This can include options such as the headless flag to launch the browser in headless mode.

  2. gotoOptions:一个可选对象,用于指定要传递给 page.goto() 方法的附加选项。这可以包含诸如 timeout 选项之类的选项,用于指定最大导航时间(以毫秒为单位),或者 waitUntil 选项,用于指定何时将导航视为成功。

    ¥gotoOptions: an optional object that specifies additional options to pass to the page.goto() method. This can include options such as the timeout option to specify the maximum navigation time in milliseconds, or the waitUntil option to specify when to consider the navigation as successful.

  3. evaluate:一个可选函数,可用于使用自定义求值函数对页面上的 JavaScript 代码进行求值。这对于从页面中提取数据、与页面元素交互或处理特定的 HTTP 响应非常有用。该函数应返回一个 Promise,该 Promise 解析为包含评估结果的字符串。

    ¥evaluate: an optional function that can be used to evaluate JavaScript code on the page using a custom evaluation function. This can be useful for extracting data from the page, interacting with page elements, or handling specific HTTP responses. The function should return a Promise that resolves to a string containing the result of the evaluation.

通过将这些选项传递给 PlaywrightWebBaseLoader 构造函数,你可以自定义加载器的行为,并使用 Playwright 的强大功能来抓取网页数据并进行交互。

¥By passing these options to the PlaywrightWebBaseLoader constructor, you can customize the behavior of the loader and use Playwright's powerful features to scrape and interact with web pages.

以下是一个简单的示例:

¥Here is a basic example to do it:

import {
PlaywrightWebBaseLoader,
Page,
Browser,
} from "@langchain/community/document_loaders/web/playwright";

const url = "https://www.tabnews.com.br/";
const loader = new PlaywrightWebBaseLoader(url);
const docs = await loader.load();

// raw HTML page content
const extractedContents = docs[0].pageContent;

更高级的示例:

¥And a more advanced example:

import {
PlaywrightWebBaseLoader,
Page,
Browser,
} from "@langchain/community/document_loaders/web/playwright";

const loader = new PlaywrightWebBaseLoader("https://www.tabnews.com.br/", {
launchOptions: {
headless: true,
},
gotoOptions: {
waitUntil: "domcontentloaded",
},
/** Pass custom evaluate, in this case you get page and browser instances */
async evaluate(page: Page, browser: Browser, response: Response | null) {
await page.waitForResponse("https://www.tabnews.com.br/va/view");

const result = await page.evaluate(() => document.body.innerHTML);
return result;
},
});

const docs = await loader.load();