Welcome to the KIIT LLM Chatbot, an AI-powered system designed to retrieve and generate relevant information from PDFs, CSVs, and DOCs using Retrieval-Augmented Generation (RAG). This project leverages the power of Streamlit for the UI, providing an interactive experience for users to query documents with ease.
- Multi-Source Document Support: Integrates PDFs, CSVs, and DOC files, allowing users to query across multiple file types.
- Advanced RAG Model: Combines retrieval and generation techniques for highly relevant and accurate answers.
- Streamlit Interface: Simple, user-friendly UI for easy interaction with the system.
- Scalable Design: Built to handle a wide variety of document types and queries.
- Customizable LLM Options: Users can choose between Gemini 1.5 Flash, Hugging Face models (via serverless inference endpoint), or Ollama on-device models (currently using Deepseek-R1-1.5b).
- Ensemble Retriever: Combines PDF, CSV, and TXT retrievers using LangChain's EnsembleRetriever with equal weights (0.33 each).
- History-Aware Retrieval: Implements LangChain's create_history_aware_retriever for context-aware querying.
- Python: Core language for the system.
- LangChain: Framework used for managing retrieval and generation.
- Streamlit: Provides the interactive frontend for users.
- FAISS: Vector store for efficient document retrieval.
- Gemini API / Hugging Face Models: To handle the LLM-based query generation.
- PyPDF2, pandas, docx2txt: Used to parse PDFs, CSVs, and DOC files.
Make sure you have the following installed:
- Python 3.8+
- Required packages (found in
requirements.txt)
-
Clone the repository:
git clone https://github.com/Manodeepray/kiit-chatbot-llm.git cd kiit-chatbot-llm -
Install dependencies:
pip install -r requirements.txt
-
Add your documents in the data folder:
-
Create the vectorstores:
python vector_db.py
-
Run the application:
streamlit run app.py
📂 kiit-llm-rag-chatbot/
│
├── 📁 src/
| |
| ├──📁 data/
│ ├── example.pdf
│ ├── example.csv
│ └── example.txt
│
├── app.py # Main Streamlit app
├── main_rag.py # RAG logic and document processing
├── requirements.txt # Python package requirements
├── README.md # Project documentation
└── LICENSE
- Upload Documents: Users can upload PDFs, CSVs, and TXT files to the interface.
- Query the System: After the files are processed, users can input their queries.
- Retrieval & Generation:
- The system uses an Ensemble Retriever to fetch information from PDFs, CSVs, and TXT files.
- Retrieval incorporates history awareness for more contextually relevant results.
- Response Generation: The RAG system retrieves relevant information and generates a coherent, human-like response using the selected LLM.
- Display Answers: Answers are displayed interactively within the Streamlit app.
Once the system is running, upload your documents and enter a query such as:
"What are the key highlights from the meeting notes in the uploaded TXT file?"
The system will return a detailed response based on the contents of the uploaded TXT file.
- Multi-Language Support: Expand capabilities to support document querying in multiple languages.
- Real-Time Document Updating: Allow dynamic updates when documents are added or modified.
- Improved Performance: Further optimize the RAG pipeline for faster responses.
- Implement Advance RAG Techniques: Add MM-RAG and Cached-Augmented generation to the pipeline.
Here is an awesome image of my project:
This project is licensed under the MIT License. See the LICENSE file for more details.
Contributions are welcome! To get started:
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a pull request.
For any inquiries or questions, feel free to reach out at:
Manodeep Ray
Email: Manodeep
Happy Querying!
