Skip to main content

Spider

Spiderfastest 爬虫。它可以将任何网站转换为纯 HTML、Markdown、元数据或文本,同时允许你使用 AI 进行自定义操作抓取。

¥Spider is the fastest crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

概述

¥Overview

Spider 允许你使用高性能代理来避免检测、缓存 AI 操作、用于抓取状态的 webhook、计划抓取等……

¥Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...

本指南展示了如何使用 Spider 抓取/爬取网站,以及如何在 LanghChain 中使用 SpiderLoader 加载 LLM 就绪文档。

¥This guide shows how to crawl/scrape a website using Spider and loading the LLM-ready documents with SpiderLoader in LanghChain.

设置

¥Setup

spider.cloud 上获取你自己的 Spider API 密钥。

¥Get your own Spider API key on spider.cloud.

用法

¥Usage

以下是如何使用 SpiderLoader 的示例:

¥Here's an example of how to use the SpiderLoader:

Spider 提供两种抓取模式:scrapecrawl。Scrape 仅获取提供的 URL 内容,而 Crawl 则获取提供的 URL 内容并进一步抓取以下子页面。

¥Spider offers two scraping modes scrape and crawl. Scrape only gets the content of the url provided while crawl gets the content of the url provided and crawls deeper following subpages.

npm install @langchain/community @langchain/core @spider-cloud/spider-client
import { SpiderLoader } from "@langchain/community/document_loaders/web/spider";

const loader = new SpiderLoader({
url: "https://spider.cloud", // The URL to scrape
apiKey: process.env.SPIDER_API_KEY, // Optional, defaults to `SPIDER_API_KEY` in your env.
mode: "scrape", // The mode to run the crawler in. Can be "scrape" for single urls or "crawl" for deeper scraping following subpages
// params: {
// // optional parameters based on Spider API docs
// // For API documentation, visit https://spider.cloud/docs/api
// },
});

const docs = await loader.load();

API Reference:

  • SpiderLoader from @langchain/community/document_loaders/web/spider

其他参数

¥Additional Parameters

请参阅 Spider 文档,获取所有可用的 params

¥See the Spider documentation for all the available params.