Web Crawler and Textual Data Vectorization Project

This project aims to implement a web crawling mechanism in JavaScript or TypeScript to fetch and extract data from any provided URL or website. The solution is designed to handle dynamic content or elements loaded asynchronously on web pages. Additionally, it includes a method to convert the crawled textual data into vectorized representations using an appropriate vectorization technique, such as Word Embeddings, and store the data in a Vector DB.

Features

Web Crawling: Implements a web crawling mechanism to fetch and extract data from any provided URL or website.
Dynamic Content Handling: Capable of handling dynamic content or elements loaded asynchronously on web pages.
Textual Data Vectorization: Converts the crawled textual data into vectorized representations using an appropriate technique (e.g., Word Embeddings).
Vector Database: Stores the vectorized data in a Vector DB for efficient retrieval and analysis.
Text Query System: Develops a system where users can submit text queries, vectorizing them using the same technique, and providing the top 3 relevant crawled data.

Project Structure

The project is structured into the following directories:

server: Contains the backend server code.
client: Includes the frontend React application code.

Technologies Used

Express: Backend framework for handling HTTP requests and responses.
Pupeeteer: Library to extract text from the requested URL.
MongoDB Atlas: Hosted NoSQL database for storing and querying vectorized data.
Gemini API: API used for creating embedddings.
React: Frontend library for building user interfaces.

Note

This project will work with Hosted Mongo Atlas after adding vector keys too it.

Installation

Clone the repository: git clone https://github.com/weekash/vectoroid.git
Navigate to the server directory: cd server
Install dependencies: npm install
Add a .env file at root containing variables from .env.example
Start the server: npm start
Navigate to the client directory: cd client
Install dependencies: npm install
Add a .env file at root containing variables from .env.example
Start the client: npm run dev

Setting Up

Get your Gemini API Key here https://aistudio.google.com/app/apikey
Enabling vector search in hosted mongodb atlas https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/

Usage

Access the web application in your browser.
Enter a URL or website to crawl and extract data.
Submit text queries to retrieve the top 3 relevant crawled data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Web Crawler and Textual Data Vectorization Project

Features

Project Structure

Technologies Used

Note

Installation

Setting Up

Usage

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Web Crawler and Textual Data Vectorization Project

Features

Project Structure

Technologies Used

Note

Installation

Setting Up

Usage