A production-ready multimodal AI copilot that blends Retrieval-Augmented Generation (RAG), NVIDIA NIM-hosted large language models, and real-time voice + vision understanding to answer questions across text, code, audio, and imagery.
- Enterprise-grade RAG: Combines NVIDIA
ai-embed-qa-4embeddings with LangChain memory to ground answers in private knowledge bases. - Multimodal mastery: Seamlessly routes prompts to Meta Llama 3 (text), IBM Granite (code), Microsoft Phi-3 Vision (image reasoning), and Whisper ASR (speech-to-text).
- Voice-first UX: Streamlit chat with live transcription, hallucination-safe streaming responses, and persistent conversation history.
- Extensible toolchain: Plug in new assistants, tools, or retrieval sources without rewriting the UI.
- Context-aware chat powered by Meta Llama 3 70B.
- Developer-first code pairer with IBM Granite 34B Code Instruct.
- Visual question answering & OCR-style reasoning via Phi-3 Vision.
- Studio-quality speech transcription courtesy of Whisper/NVIDIA ASR.
- Long-term memory using LangChain
ConversationBufferMemory. - RAG on your documents with persistent
vectorstore.pklor custom embeddings.
- Interface: Streamlit, PIL
- Orchestration: LangChain, custom Assistant Router
- Models via NVIDIA NIM:
meta/llama3-70b-instruct,ibm/granite-34b-code-instruct,microsoft/phi-3-vision-128k-instruct, Whisper/NEMO ASR - Embeddings:
ai-embed-qa-4(passage + query dual encoders) - Persistence:
vectorstore.pkl,ConversationBufferMemory, local.wavcaptures
- Python ≥ 3.9
- NVIDIA API access (NIM endpoints) and OpenAI-compatible Whisper key
ffmpeginstalled for audio capture (optional but recommended)
git clone https://github.com/skyroom07/Multi-LLM-Voice-Agent-with-RAG.git
cd Multi-LLM-Voice-Agent-with-RAG
python -m venv .venv
.\.venv\Scripts\activate # On macOS/Linux: source .venv/bin/activate
pip install -r requirements.txtCreate a .env file in the project root:
NVIDIA_API_KEY=nvapi_xxx
OPENAI_API_KEY=sk-xxx # Needed for Whisper fallback
WHISPER_MODEL=base
EMBEDDINGS_MODEL=ai-embed-qa-4💡 Add any custom retriever credentials (S3, Elastic, etc.) in the same
.env.
streamlit run app.pyOpen the local URL (default http://localhost:8501) and start chatting, uploading vision files, or recording audio.
.
├── agent/ # Future agent tools & notebooks
├── chains/ # Assistant router, memory, and model wrappers
│ ├── models/ # Whisper / NeMo ASR interfaces
│ ├── language_assistant.py # Llama 3 text assistant
│ ├── code_assistant.py # Granite code assistant
│ └── vision_assistant.py # Phi-3 Vision assistant
├── utils/ # Helper utilities (image, routing, etc.)
├── test/ # Reference screenshots & assets
├── app.py # Streamlit entry point
├── requirements.txt # Python dependencies
├── vectorstore.pkl # Sample persisted embeddings
└── recorded_audio.wav # Latest captured utterance
- Embed uploaded docs with
ai-embed-qa-4(seechains/embedding_models.py). - Persist vectors to disk (
vectorstore.pkl,.npy, or your DB of choice). - Route each user turn through
AssistantRouter, which inspects the input and attached media. - Ground the response using the central LangChain memory + nearest vector hits.
- Stream answers back to the Streamlit UI with inline code blocks, citations, or structured JSON.
Need enterprise storage? Swap
vectorstore.pklwith Pinecone, Milvus, or pgvector by extendingEmbeddingModels.
- Text: Type into the chat box and watch Llama 3 stream responses with traceable reasoning.
- Code: Paste snippets or debugging logs; Granite returns fixes, tests, and refactors with syntax-highlighted blocks.
- Vision: Upload PNG/JPG files; Phi-3 Vision handles captioning, OCR-style extraction, or multimodal reasoning.
- Voice: Click “Record and Transcribe Audio”; Whisper converts to text and routes to the best assistant automatically.
- Memory: Prior turns persist per-session via
ConversationBufferMemory, so follow-ups stay contextual.
- Set
streamlit run app.py --server.headless truefor remote servers. - Use
st.secretsin Streamlit Cloud; locally prefer.env. - Record troubleshooting logs by toggling
verbose=Truein LangChain chains. - Regenerate embeddings anytime your knowledge base changes:
from chains.embedding_models import EmbeddingModels emb = EmbeddingModels() emb.save_embedding("my_doc", emb.embed_documents(["content here"]))
- Plug-and-play tool executor (SQL, browser, automation hooks)
- GPU-accelerated on-device Whisper + NeMo fallback
- Conversation summarization + CRM handoff webhooks
- Native deployment template for Streamlit Community Cloud + NVIDIA Inference Microservices
- Fork the repo & create a feature branch.
- Run
ruff/black(or your formatter of choice) before committing. - Submit a PR describing the use case and any new environment variables.
Let’s build safer, smarter multimodal copilots together. 🔊🖼️💻

