IntelliSearch using Crawler and RAG

IntelliSearch web crawler is an intelligent web crawler that leverages advanced AI language models (LLMs) along with modern search techniques to deliver precise, context-aware answers. The system employs dense vector retrieval (via Qdrant) and RAG Fusion for re-ranking, and it’s designed to be easily extended with advanced techniques such as Late Interaction and token-level refinement.

Technologies

Features

Hybrid LLM Integration:
- Local LLMs: Run directly on your machine for enhanced data privacy and control.
- Paid API LLMs: Utilize cutting-edge models like OpenAI’s GPT-4 for superior performance and real-time capabilities.
Efficient Vector Search with Qdrant:
- Search results: SerpAPI free use tier for Google search results
- Fast & Accurate: Qdrant efficiently stores and retrieves dense embeddings to ensure quick and precise search results at scale.
Advanced Retrieval Techniques:
- RAG Fusion Reranking: Merges multiple search results using reciprocal rank fusion to prioritize the most relevant documents.
- Planned Enhancements: Integrate Late Interaction techniques (e.g., ColBERT-style token-level re-ranking) and hybrid search methods (combining dense embeddings with BM25).
Enterprise-Ready:
- Customizable for Closed Systems: Easily tunable for internal databases and proprietary search systems, similar to industry-leading apps similar to Perplexity.
User-Friendly Interface:
- Gradio UI: A simple, interactive web-based interface for seamless user interactions.

📂 Folder Structure

WebcrawlerRAG/
  ├── components/
  │   ├── chat_logic.py          # Contains the main logic for handling chat interactions and RAG techniques
  │   ├── ranking_modes.py       # Contains functions for different ranking modes like reciprocal rank fusion and unique union
  ├── services/
  │   ├── search_service.py      # Handles document search and loading
  ├── utils/
  │   ├── config.py              # Configuration settings for the project
  ├── app.py                     # Main application file to launch the Gradio UI
  ├── models.properties          # Configuration file listing available models
  ├── requirements.txt           # List of dependencies required for the project
  ├── README.md                  # Project documentation and instructions
  ├── .env                       # Environment variables (e.g., API keys, database URLs)

How It Works

Document Loading & Processing
- The system fetches documents via the search service and splits them into manageable chunks.
Vector Storage & Retrieval
- Chunks are embedded using a dense embedding model and stored in Qdrant. Retrieval is performed using dense vector search.
RAG Fusion Re-ranking
- Multiple search queries are generated, and results are merged using reciprocal rank fusion or Unique union for broader search use-cases to prioritize accurate matching.
Answer Synthesis
- The retrieved context is fed into an LLM (local or API-based) to generate a final answer in markdown format with links to sources.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Feel free to star, fork and contribute to this project and share your feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IntelliSearch using Crawler and RAG

Technologies

Features

📂 Folder Structure

How It Works

License

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
components		components
services		services
utils		utils
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
models.properties		models.properties
requirements.txt		requirements.txt

License

ashwantmanikoth/IntellilSearch

Folders and files

Latest commit

History

Repository files navigation

IntelliSearch using Crawler and RAG

Technologies

Features

📂 Folder Structure

How It Works

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages