🚀VLM-powered-Multimodal-Conversational-AI-Bot

An advanced AI-powered chatbot that enables users to upload images, ask questions in text or audio, and receive real-time responses in both text and audio formats.

Features

Image Upload and Analysis: Accepts uploaded images or captures photos directly via webcam.
Speech-to-Text: Converts spoken questions into text using Google SpeechRecognition API.
Multimodal Chat Interaction: Integrates visual (image), text, and audio data for comprehensive user interaction.
Multi-Round Conversations: Retains conversation context across multiple turns for dynamic and coherent interactions.
Real-Time Responses: Provides responses powered by OpenAI's multimodal models.
Text-to-Speech: Reads out chatbot responses using pyttsx3 for enhanced accessibility.

Technology Stack

Frontend: Streamlit for building an interactive user interface.
Backend: OpenAI's multimodal API integration for intelligent conversational processing using state-of-the-art InternVL VLLM model.
RunPod: InternVL model's responses are made available by serving and inferring the model using vLLM-powered endpoints provided by RunPod Platform.
Cloudinary: For secure image upload and hosting.
Google SpeechRecognition API: Converts spoken input to text.
Pyttsx3: For generating voice responses.
Pillow (PIL): Processes and displays images.

How It Works

Upload or Capture an Image:
- Upload an image file in .png, .jpg, or .jpeg formats.
- Alternatively, use your webcam to capture a photo directly.
Ask Your Question:
- Speak into your microphone or type your question.
- The bot processes both your question and the image for context.
Receive Responses:
- Textual responses appear on-screen.
- Optionally, listen to the response via the text-to-speech feature by pressing the Speak button available below the generated response.

Installation

Prerequisites

Python 3.8 or higher.
Install dependencies using pip.

Setup

Clone the repository:

git clone https://github.com/fork123aniket/VLM-powered-Multimodal-Conversational-AI-Bot.git
cd VLM-powered-Multimodal-Conversational-AI-Bot

Install dependencies:
```
pip install -r requirements.txt
```
Set up API Keys and Endpoint:
- Replace placeholders in the script for OpenAI, RunPod Endpoint, and Cloudinary.
Run the application:
```
streamlit run multimodal_chatbot.py
```

Usage

Launch the Web App: Open the provided local server link in your browser.
Interact: Upload or capture an image, speak, or type your question by pressing the Ask Me Anything! button.
Explore Responses: Read or listen to the chatbot's answers by pressing the Submit button.

Project Structure

├── multimodal_chatbot.py   # Main application script  
├── requirements.txt        # Dependencies  
└── README.md               # Project documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🚀VLM-powered-Multimodal-Conversational-AI-Bot

Features

Technology Stack

How It Works

Installation

Prerequisites

Setup

Usage

Project Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

🚀VLM-powered-Multimodal-Conversational-AI-Bot

Features

Technology Stack

How It Works

Installation

Prerequisites

Setup

Usage

Project Structure