Skip to main content

upsertFromWebpageRecursive

Inserts or updates vectors from a webpage and its linked pages recursively.

upsertFromWebpageRecursive(
webpage: string,
options?: {
excludeDirs?: string[]
maxDepth?: number
timeout?: number
preventOutside?: boolean
metadata?: Record<string, any>
textSplitter?: SplitterParams
}
): Promise<string[]>

Reference

import { myVectorStore } from "#elements";
export default async function () {
const count = await myVectorStore.upsertFromWebpageRecursive(
"https://docs.babel.cloud/docs/overview",
{ maxDepth: 2 }
);
console.log(`${count} vectors upserted`);
}

Parameters

  • webpage: The URL of the webpage to start extracting content from.
  • options: Optional configuration parameters, including:
    • excludeDirs: (optional) Webpage directories to exclude.
    • maxDepth: (optional) The maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.
    • timeout: (optional) The timeout for each request, in the unit of seconds, defaults to 10000 (10 seconds).
    • preventOutside: (optional) Whether to prevent crawling outside the root url, defaults to true.
    • metadata: (optional) The metadata to associate with the vectors.
    • textSplitter: (optional) The text splitter employed to divide the content into multiple vectors. In the absence of a provided splitter, the token splitter is used by default.

Returns

Promise of an array of IDs of the upserted vectors.

Caveats

  • This method will insert a new vector for each webpage, or update the existing vector if the webpage has been previously upserted.
  • At times, this approach might not recursively load in the manner anticipated, which is contingent upon the configuration of the website's menu.
  • You can query all the results by filtering the metadata field source-by-babel to webpage.