Skip to main content

GitHub

本示例介绍如何从 GitHub 存储库加载数据。你可以将 GITHUB_ACCESS_TOKEN 环境变量设置为 GitHub 访问令牌,以提高速率限制并访问私有仓库。

¥This example goes over how to load data from a GitHub repository. You can set the GITHUB_ACCESS_TOKEN environment variable to a GitHub access token to increase the rate limit and access private repositories.

设置

¥Setup

GitHub 加载器需要 忽略 npm 包 作为对等依赖。按如下方式安装:

¥The GitHub loader requires the ignore npm package as a peer dependency. Install it like this:

npm install @langchain/community @langchain/core ignore

用法

¥Usage

import { GithubRepoLoader } from "@langchain/community/document_loaders/web/github";

export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.com/langchain-ai/langchainjs",
{
branch: "main",
recursive: false,
unknown: "warn",
maxConcurrency: 5, // Defaults to 2
}
);
const docs = await loader.load();
console.log({ docs });
};

API Reference:

加载器将忽略图片等二进制文件。

¥The loader will ignore binary files like images.

使用 .gitignore 语法

¥Using .gitignore Syntax

要忽略特定文件,你可以将 ignorePaths 数组传入构造函数:

¥To ignore specific files, you can pass in an ignorePaths array into the constructor:

import { GithubRepoLoader } from "@langchain/community/document_loaders/web/github";

export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.com/langchain-ai/langchainjs",
{ branch: "main", recursive: false, unknown: "warn", ignorePaths: ["*.md"] }
);
const docs = await loader.load();
console.log({ docs });
// Will not include any .md files
};

API Reference:

使用不同的 GitHub 实例

¥Using a Different GitHub Instance

你可能需要使用与 github.com 不同的 GitHub 实例,例如,如果你的公司有一个 GitHub Enterprise 实例。为此,你需要两个附加参数:

¥You may want to target a different GitHub instance than github.com, e.g. if you have a GitHub Enterprise instance for your company. For this you need two additional parameters:

  • baseUrl - GitHub 实例的基 URL,因此 githubUrl 与 <baseUrl>/<owner>/<repo>/... 匹配

    ¥baseUrl - the base URL of your GitHub instance, so the githubUrl matches <baseUrl>/<owner>/<repo>/...

  • apiUrl - 你的 GitHub 实例的 API 端点的 URL

    ¥apiUrl - the URL of the API endpoint of your GitHub instance

import { GithubRepoLoader } from "@langchain/community/document_loaders/web/github";

export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.your.company/org/repo-name",
{
baseUrl: "https://github.your.company",
apiUrl: "https://github.your.company/api/v3",
accessToken: "ghp_A1B2C3D4E5F6a7b8c9d0",
branch: "main",
recursive: true,
unknown: "warn",
}
);
const docs = await loader.load();
console.log({ docs });
};

API Reference:

处理子模块

¥Dealing with Submodules

如果你的存储库包含子模块,你必须决定加载器是否应该遵循它们。你可以使用布尔值 processSubmodules 参数控制这一点。默认情况下,不处理子模块。请注意,处理子模块仅与将 recursive 参数设置为 true 结合使用。

¥In case your repository has submodules, you have to decide if the loader should follow them or not. You can control this with the boolean processSubmodules parameter. By default, submodules are not processed. Note that processing submodules works only in conjunction with setting the recursive parameter to true.

import { GithubRepoLoader } from "@langchain/community/document_loaders/web/github";

export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.com/langchain-ai/langchainjs",
{
branch: "main",
recursive: true,
processSubmodules: true,
unknown: "warn",
}
);
const docs = await loader.load();
console.log({ docs });
};

API Reference:

请注意,加载器不会跟踪位于当前存储库之外的其他 GitHub 实例上的子模块。

¥Note, that the loader will not follow submodules which are located on another GitHub instance than the one of the current repository.

大型代码库的流式传输

¥Stream large repository

使用分数进行相似度搜索:你可以使用 loadAsStream 方法从整个 GitHub 存储库异步传输文档。

¥For situations where processing large repositories in a memory-efficient manner is required. You can use the loadAsStream method to asynchronously streams documents from the entire GitHub repository.

import { GithubRepoLoader } from "@langchain/community/document_loaders/web/github";

export const run = async () => {
const loader = new GithubRepoLoader(
"https://github.com/langchain-ai/langchainjs",
{
branch: "main",
recursive: false,
unknown: "warn",
maxConcurrency: 3, // Defaults to 2
}
);

const docs = [];
for await (const doc of loader.loadAsStream()) {
docs.push(doc);
}

console.log({ docs });
};

API Reference: