A real-time voice chatbot that handles multi-turn conversations using ASR (Whisper), LLM (LLaMA 3), and TTS (BentoTTS).
- Speech-to-text transcription using OpenAI Whisper
- Natural language understanding and response generation using LLaMA 3
- Text-to-speech synthesis using BentoTTS
- 5-turn conversation memory
- RESTful API built with FastAPI
- Python 3.8 or higher
- CUDA-compatible GPU (recommended, but CPU fallback available)
- 16GB+ RAM (for running LLM models)
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up BentoTTS server:
Follow the instructions at BentoXTTS to set up and run the TTS server:
# Clone BentoXTTS repository
git clone https://github.com/bentoml/BentoXTTS.git
cd BentoXTTS
# Install and run
pip install -r requirements.txt
bentoml serve service:XTTSThe BentoTTS server should be running on http://localhost:3000.
uvicorn main:app --reloadThe server will start on http://localhost:8000.
Main endpoint for voice interaction. Upload an audio file and receive a voice response.
Example using curl:
curl -X POST "http://localhost:8000/chat/" \
-H "accept: audio/wav" \
-H "Content-Type: multipart/form-data" \
-F "file=@input.wav" \
--output response.wavExample using Python:
import requests
with open("input.wav", "rb") as audio_file:
response = requests.post(
"http://localhost:8000/chat/",
files={"file": audio_file}
)
with open("response.wav", "wb") as f:
f.write(response.content)View current conversation history and turn count.
curl http://localhost:8000/conversationReset the conversation history.
curl -X POST http://localhost:8000/resetCheck system health and model status.
curl http://localhost:8000/health.
├── main.py # Main FastAPI application
├── requirements.txt # Python dependencies
├── README.md # This file
├── temp_files/ # Temporary audio files (auto-created)
└── Class 3 Homework.ipynb # Assignment instructions
- Audio Upload: Client sends audio file via POST request
- ASR (Automatic Speech Recognition): Whisper transcribes audio to text
- LLM Processing: LLaMA 3 generates a contextual response based on conversation history
- TTS (Text-to-Speech): BentoTTS converts the response to audio
- Audio Response: Server returns synthesized speech as WAV file
Key configuration variables in main.py:
MAX_CONVERSATION_TURNS: Number of conversation turns to remember (default: 5)TEMP_DIR: Directory for temporary audio files- Whisper model size:
"small"(can change to"base","medium","large") - LLM model:
"meta-llama/Llama-3.2-3B-Instruct"(fallback to 1B if memory limited)
If you encounter GPU memory issues, the code automatically falls back to a smaller model (Llama-3.2-1B) or CPU mode.
Make sure the BentoTTS server is running on http://localhost:3000 before starting the voice assistant.
Ensure your input audio is in a compatible format (WAV, MP3, etc.). Whisper supports most common audio formats.
You can test the API using:
curlcommands (see examples above)- Postman or similar API testing tools
- The Swagger UI at
http://localhost:8000/docs - A custom frontend application
See LICENSE file for details.