Skip to main content

html-to-text

在提取 HTML 文档以供以后检索时,我们通常只对网页的实际内容感兴趣,而不是语义。使用 HtmlToTextTransformer 从文档中剥离 HTML 标签可以生成内容更丰富的数据块,从而提高检索效率。

¥When ingesting HTML documents for later retrieval, we are often interested only in the actual content of the webpage rather than semantics. Stripping HTML tags from documents with the HtmlToTextTransformer can result in more content-rich chunks, making retrieval more effective.

设置

¥Setup

你需要安装 html-to-text npm 包:

¥You'll need to install the html-to-text npm package:

npm install html-to-text

虽然转换器本身并非必需,但以下使用示例需要 cheerio 进行数据抓取:

¥Though not required for the transformer by itself, the below usage examples require cheerio for scraping:

npm install cheerio
npm install @langchain/community @langchain/core

用法

¥Usage

以下示例抓取了 Hacker News 帖子,根据 HTML 标签将其拆分,并根据标签中的语义信息对块进行分组,然后从各个块中提取内容:

¥The below example scrapes a Hacker News thread, splits it based on HTML tags to group chunks based on the semantic information from the tags, then extracts content from the individual chunks:

import { HTMLWebBaseLoader } from "@langchain/community/document_loaders/web/html";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HtmlToTextTransformer } from "@langchain/community/document_transformers/html_to_text";

const loader = new HTMLWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881"
);

const docs = await loader.load();

const splitter = RecursiveCharacterTextSplitter.fromLanguage("html");
const transformer = new HtmlToTextTransformer();

const sequence = splitter.pipe(transformer);

const newDocuments = await sequence.invoke(docs);

console.log(newDocuments);

/*
[
Document {
pageContent: 'Hacker News new | past | comments | ask | show | jobs | submit login What Lights\n' +
'the Universe’s Standard Candles? (quantamagazine.org) 75 points by Amorymeltzer\n' +
'5 months ago | hide | past | favorite | 6 comments delta_p_delta_x 5 months ago\n' +
'| next [–] Astrophysical and cosmological simulations are often insightful.\n' +
"They're also very cross-disciplinary; besides the obvious astrophysics, there's\n" +
'networking and sysadmin, parallel computing and algorithm theory (so that the\n' +
'simulation programs are actually fast but still accurate), systems design, and\n' +
'even a bit of graphic design for the visualisations.Some of my favourite\n' +
'simulation projects:- IllustrisTNG:',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'that the simulation programs are actually fast but still accurate), systems\n' +
'design, and even a bit of graphic design for the visualisations.Some of my\n' +
'favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/-\n' +
'SWIFT: https://swift.dur.ac.uk/- CO5BOLD:\n' +
'https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations\n' +
'of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)-\n' +
'AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the\n' +
'simulations in the article, too. froeb 5 months ago | parent | next [–]\n' +
'Supernova simulations are especially interesting too. I have heard them\n' +
'described as the only time in physics when all 4 of the fundamental forces are\n' +
'important. The explosion can be quite finicky too. If I remember right, you\n' +
"can't get supernova to explode",
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'heard them described as the only time in physics when all 4 of the fundamental\n' +
'forces are important. The explosion can be quite finicky too. If I remember\n' +
"right, you can't get supernova to explode properly in 1D simulations, only in\n" +
'higher dimensions. This was a mystery until the realization that turbulence is\n' +
'necessary for supernova to trigger--there is no turbulent flow in 1D. andrewflnr\n' +
"5 months ago | prev | next [–] Whoa. I didn't know the accretion theory of Ia\n" +
'supernovae was dead, much less that it had been since 2011. andreareina 5 months\n' +
'ago | prev | next [–] This seems to be the paper',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'andreareina 5 months ago | prev | next [–] This seems to be the paper\n' +
'https://academic.oup.com/mnras/article/517/4/5260/6779709 andreareina 5 months\n' +
"ago | prev [–] Wouldn't double detonation show up as variance in the brightness?\n" +
'yencabulator 5 months ago | parent [–] Or widening of the peak. If one type Ia\n' +
'supernova goes 1,2,3,2,1, the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3\n' +
'0+1=1 Guidelines | FAQ | Lists |',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3 0+1=1 Guidelines | FAQ |\n' +
'Lists | API | Security | Legal | Apply to YC | Contact Search:',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
}
]
*/

API Reference:

自定义

¥Customization

你可以将任何 html-to-text 包接受的参数 传递给转换器以自定义其工作方式。

¥You can pass the transformer any arguments accepted by the html-to-text package to customize how it works.