upsertFromWebpageRecursive
Inserts or updates vectors from a webpage and its linked pages recursively.
upsertFromWebpageRecursive(
webpage: string,
options?: {
excludeDirs?: string[]
maxDepth?: number
timeout?: number
preventOutside?: boolean
metadata?: Record<string, any>
textSplitter?: SplitterParams
}
): Promise<string[]>
Reference
import { myVectorStore } from "#elements";
export default async function () {
const count = await myVectorStore.upsertFromWebpageRecursive(
"https://docs.babel.cloud/docs/overview",
{ maxDepth: 2 }
);
console.log(`${count} vectors upserted`);
}
Parameters
webpage
: The URL of the webpage to start extracting content from.options
: Optional configuration parameters, including:excludeDirs
: (optional) Webpage directories to exclude.maxDepth
: (optional) The maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.timeout
: (optional) The timeout for each request, in the unit of seconds, defaults to 10000 (10 seconds).preventOutside
: (optional) Whether to prevent crawling outside the root url, defaults to true.metadata
: (optional) The metadata to associate with the vectors.textSplitter
: (optional) The text splitter employed to divide the content into multiple vectors. In the absence of a provided splitter, the token splitter is used by default.
Returns
Promise of an array of IDs of the upserted vectors.
Caveats
- This method will insert a new vector for each webpage, or update the existing vector if the webpage has been previously upserted.
- At times, this approach might not recursively load in the manner anticipated, which is contingent upon the configuration of the website's menu.
- You can query all the results by filtering the metadata field
source-by-babel
towebpage
.