An advanced AI-powered chatbot that enables users to upload images, ask questions in text or audio, and receive real-time responses in both text and audio formats.
- Image Upload and Analysis: Accepts uploaded images or captures photos directly via webcam.
- Speech-to-Text: Converts spoken questions into text using Google SpeechRecognition API.
- Multimodal Chat Interaction: Integrates visual (image), text, and audio data for comprehensive user interaction.
- Multi-Round Conversations: Retains conversation context across multiple turns for dynamic and coherent interactions.
- Real-Time Responses: Provides responses powered by OpenAI's multimodal models.
- Text-to-Speech: Reads out chatbot responses using pyttsx3 for enhanced accessibility.
- Frontend: Streamlit for building an interactive user interface.
- Backend: OpenAI's multimodal API integration for intelligent conversational processing using state-of-the-art InternVL VLLM model.
- RunPod: InternVL model's responses are made available by serving and inferring the model using vLLM-powered endpoints provided by RunPod Platform.
- Cloudinary: For secure image upload and hosting.
- Google SpeechRecognition API: Converts spoken input to text.
- Pyttsx3: For generating voice responses.
- Pillow (PIL): Processes and displays images.
-
Upload or Capture an Image:
- Upload an image file in
.png
,.jpg
, or.jpeg
formats. - Alternatively, use your webcam to capture a photo directly.
- Upload an image file in
-
Ask Your Question:
- Speak into your microphone or type your question.
- The bot processes both your question and the image for context.
-
Receive Responses:
- Textual responses appear on-screen.
- Optionally, listen to the response via the text-to-speech feature by pressing the
Speak
button available below the generated response.
- Python 3.8 or higher.
- Install dependencies using
pip
.
- Clone the repository:
git clone https://github.com/fork123aniket/VLM-powered-Multimodal-Conversational-AI-Bot.git cd VLM-powered-Multimodal-Conversational-AI-Bot
- Install dependencies:
pip install -r requirements.txt
- Set up API Keys and Endpoint:
- Replace placeholders in the script for OpenAI, RunPod Endpoint, and Cloudinary.
- Run the application:
streamlit run multimodal_chatbot.py
- Launch the Web App: Open the provided local server link in your browser.
- Interact: Upload or capture an image, speak, or type your question by pressing the
Ask Me Anything!
button. - Explore Responses: Read or listen to the chatbot's answers by pressing the
Submit
button.
├── multimodal_chatbot.py # Main application script
├── requirements.txt # Dependencies
└── README.md # Project documentation