This repository contains the core orchestration and microservices for an interactive, AI-powered holographic retail assistant. The system utilizes a distributed microservice architecture, integrating large language models, retrieval-augmented generation, dynamic gesture control, speech processing, and a 3D React-based avatar.
Video demonstration of the Intelligent Holographic AI system in action to be uploaded soon!
While the foundational architecture builds upon established research, this project introduces system-level optimizations to satisfy the latency, accuracy, and responsiveness constraints of a real-time retail deployment:
- Length-Aware Reranking: The cross-encoder reranking stage was optimized by introducing length-aware document arrangement prior to inference. This design minimizes padding inefficiencies, reducing overall inference latency while preserving retrieval quality. Performance was benchmarked against MS MARCO and custom retail datasets, maintaining strong Mean Reciprocal Rank (MRR) and Hit Rate metrics.
- Instruction-Tuned Semantic Routing: Traditional precomputed query matching was replaced with a dynamic, instruction-tuned semantic routing mechanism. Incoming queries are encoded using a task-specific instruction function Φ with an instruction prefix (I_task) and compared directly against raw document embeddings. Evaluation on retail datasets showed measurable improvements in macro recall, F1 score, and precision, enabling more adaptive and context-aware retrieval.
- Real-Time Boxgate Logic: The baseline gesture capture pipeline was re-architected from a manual, keyboard-triggered termination model to a fully automated, continuous inference loop using custom boxgate logic. This enables real-time segmentation without user intervention.
- Performance Optimization: By eliminating manual termination overhead, the system achieves higher gesture segmentation purity and lower latency variance, resulting in smoother interaction and improved perceptual continuity for the holographic avatar.
📊 Detailed Evaluation & Metrics For a comprehensive breakdown of the empirical data supporting these improvements—including MS MARCO benchmarks, retail dataset F1/precision scores, and latency tests—please refer to the
experiment_metric.mdfile (coming soon).
The project is divided into specialized directories. Each acts as an independent microservice with its own virtual environment and dependencies, all communicating with the central main_orchestrator.py.
Chatbot_Phi2/: Core LLM engine directory. Contains code for fine-tuning and real-time inference, running as an independentmain.pymicroservice.Gesture_System/: Dynamic hand gesture control system utilizing ResNet. Handles both model training and real-time vision inference via its ownmain.py.RAG/: Retrieval-Augmented Generation pipeline using ChromaDB for contextual memory and knowledge retrieval.STT/: Speech-to-Text voice transcription layer powered by OpenAI Whisper.TTS/: Text-to-Speech voice generation layer using Coqui TTS.react_avatar/: Frontend 3D avatar rendering layer built with React.mediamtx/: Contains the configuration files for real-time media routing and streaming.
Before running the system, several external binaries and large model assets must be downloaded.
Download the following tools and place them in the root directory (or respective folder):
- FFmpeg: Required for audio/video processing. Download from https://github.com/BtbN/FFmpeg-Builds/releases and find latest assests named
ffmpeg-master-latest-win64-gpl-shared.zip.Then extract to the rootffmpeg/directory. - Rhubarb Lip Sync: Required for avatar lip-sync generation. Download from https://github.com/DanielSWolf/rhubarb-lip-sync/releases/tag/v1.14.0 and find the latest assests named
Rhubarb-Lip-Sync-1.14.0-Windows.zip. Then extract to the rootrhubarb/directory. - MediaMTX: Required for media streaming. Download the binary from https://github.com/bluenviron/mediamtx/releases/tag/v1.16.1 and find the latest assests named
mediamtx_v1.16.1_windows_amd64.zip. Then place it inside themediamtx/directory alongside the configuration files.
Due to file size limits, datasets, fine-tuned models, and heavy 3D assets are hosted externally on Hugging Face: [INSERT_HUGGINGFACE_PROFILE_LINK]
Please download and place the following assets into their respective directories:
Chatbot_Phi2/: Download the specific datasets and model weights.Gesture_System/: Download the ResNet training datasets and inference models.react_avatar/: Download thepublic/directory containing the rendered 3D avatar files and place it inside the frontend folder.
Because this project uses a microservice architecture, each Python directory requires its own separate virtual environment.
For each of the following directories (Chatbot_Phi2, Gesture_System, RAG, STT, TTS), navigate into the folder, create a virtual environment, and install its specific dependencies:
cd [Directory_Name]
python -m venv venv
# Activate the venv (Windows):
venv\Scripts\activate
# OR Activate the venv (Mac/Linux):
source venv/bin/activate
pip install -r requirements.txt
deactivate
cd ..
Navigate to the frontend directory and install the Node packages:
cd react_avatar
npm install
cd ..
Finally, setup the root environment that ties everything together:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
The entire microservice architecture is fully automated through the central orchestrator. You do not need to manually start each individual component.
To launch the complete Intelligent Holographic AI system:
- Open your terminal in the root directory.
- Ensure your root virtual environment is activated.
- Run the orchestrator:
python main_orchestrator.py
(Note: dummy_gesture_control.py and dummy_no_mic.py are provided at the root level for testing isolated orchestrator components without full hardware requirements).
This project builds upon and significantly modifies concepts from the following academic research:
- RAG & LLM Architecture: The foundational retrieval-augmented generation structure was inspired by TeleOracle: Fine-Tuned Retrieval-Augmented Generation With Long-Context Support for Networks (Alabbasi et al., IEEE Internet of Things Journal, 2025). In this repository, the architecture has been uniquely adapted and improved to support real-time retail microservices using Microsoft Phi-2 and ChromaDB.
- Dynamic Gesture System: The core vision methodology is based on Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture (Habib, Yusuf, & Moustafa, MDPI Technologies, 2025). The system has been modified and fine-tuned for specialized, real-time interactive avatar control using ResNet.