AI-Powered Voice Assistant with Emotional Intelligence
Features • Quick Start • Architecture • API Usage • Tech Stack
ConversaVoice is a context-aware voice assistant that understands emotions and responds with appropriate tone and empathy. It combines cutting-edge AI technologies to create natural, emotionally intelligent conversations.
🎤 You speak → 🧠 AI understands → 💬 Smart response → 🔊 Natural voice
| Traditional Assistants | ConversaVoice |
|---|---|
| Monotone responses | Emotional, expressive speech |
| Forgets context | Remembers conversation history |
| Generic replies | Personalized, context-aware |
| Robotic voice | Natural human-like tone |
- Speech-to-Text: Groq Whisper API for fast, accurate transcription
- Text-to-Speech: Azure Neural TTS with emotional expressiveness
- Real-time: Low-latency streaming pipeline
- LLM-Powered: Groq API with Llama 3.3 70B for smart replies
- Context-Aware: Remembers conversation history
- Emotion Detection: Adapts tone based on user sentiment
- Sentiment Analysis: Detects frustration, happiness, confusion
- Adaptive Prosody: Changes pitch, rate, and tone dynamically
- Empathetic Responses: De-escalation when user is frustrated
- Redis-Backed: Persistent session storage
- Repetition Detection: Knows when user repeats themselves
- Preference Tracking: Remembers user preferences
- 30+ Emotion Styles: Cheerful, empathetic, calm, excited...
- Word Emphasis: Stress important words naturally
- Prosody Control: Fine-tune pitch, rate, and volume
- Fallback System: Auto-switch to local models if cloud fails
- Ollama Backup: Local LLM fallback
- Piper TTS Backup: Local voice synthesis
- Python 3.10+
- Docker (for Redis)
- API Keys:
- Groq API (Free)
- Azure Speech (Free tier)
# Clone the repository
git clone https://github.com/Speech-Synthesis/ConversaVoice.git
cd ConversaVoice
# Create virtual environment
python -m venv venv
# Activate (Windows)
.\venv\Scripts\activate
# Activate (Linux/Mac)
source venv/bin/activate
# Install dependencies
pip install -r backend/requirements.txt
# Copy environment file
cp .env.example .env
# Edit .env with your API keys# .env file
GROQ_API_KEY=your_groq_api_key
AZURE_SPEECH_KEY=your_azure_key
AZURE_SPEECH_REGION=eastus
REDIS_HOST=localhost
REDIS_PORT=6379
STT_BACKEND=groq1. Start Redis:
docker run -d -p 6379:6379 redis2. Start Backend (Terminal 1):
cd backend
uvicorn main:app --reload --port 80003. Start Frontend (Terminal 2):
cd frontend
streamlit run app.py4. Open Browser:
http://localhost:8501
┌─────────────────────────────────────────────────────────────────┐
│ ConversaVoice │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Frontend │───▶│ Backend │───▶│ LLM │───▶│ TTS │ │
│ │Streamlit │ │ FastAPI │ │ Groq │ │ Azure │ │
│ └──────────┘ └────┬─────┘ └──────────┘ └──────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Redis │ │ STT │ │ NLP │ │
│ │ Memory │ │ Groq │ │Sentiment │ │
│ └─────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
User speaks
│
▼
┌─────────────────┐
│ Groq Whisper │ ← Speech-to-Text
│ (STT) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Sentiment + │ ← Analyze emotion
│ Context Check │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Redis Memory │ ← Fetch history
│ + Preferences │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Groq LLM │ ← Generate response
│ (Llama 3.3) │ with emotion style
└────────┬────────┘
│
▼
┌─────────────────┐
│ Azure TTS │ ← Convert to speech
│ (Neural Voice) │ with prosody
└────────┬────────┘
│
▼
User hears response
from src.orchestrator import Orchestrator
import asyncio
async def main():
# Initialize
orch = Orchestrator(session_id="user-123")
await orch.initialize()
# Process voice/text
result = await orch.process_text("I'm frustrated with my order!")
print(f"Response: {result.assistant_response}")
print(f"Emotion Style: {result.style}") # "empathetic"
print(f"Latency: {result.latency_ms}ms")
await orch.shutdown()
asyncio.run(main())Health Check:
curl http://localhost:8000/api/healthCreate Session:
curl -X POST http://localhost:8000/api/sessionChat:
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"text": "Hello!", "session_id": "your-session-id"}'Transcribe Audio:
curl -X POST http://localhost:8000/api/transcribe \
-F "audio=@recording.wav" \
-F "session_id=your-session-id"Synthesize Speech:
curl -X POST http://localhost:8000/api/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello!", "style": "cheerful"}'ConversaVoice adapts voice characteristics based on context:
| Style | When Used | Voice Effect |
|---|---|---|
neutral |
Normal conversation | Standard tone |
cheerful |
Good news, greetings | Higher pitch, faster |
empathetic |
User frustrated/sad | Softer, slower |
patient |
Explaining complex topics | Calm, measured |
de_escalate |
User very angry | Very soft, slow |
<speak version="1.0" xmlns:mstts="http://www.w3.org/2001/mstts">
<voice name="en-US-JennyNeural">
<mstts:express-as style="empathetic" styledegree="1.3">
I understand how frustrating this must be.
<emphasis level="strong">We'll fix this right away.</emphasis>
</mstts:express-as>
</voice>
</speak>ConversaVoice/
├── backend/ # FastAPI backend
│ ├── api/
│ │ ├── routes.py # API endpoints
│ │ └── models.py # Pydantic models
│ ├── services/
│ │ └── orchestrator_service.py
│ └── main.py # App entry point
│
├── frontend/ # Streamlit UI
│ ├── app.py # Main UI
│ └── api_client.py # Backend client
│
├── src/ # Core modules
│ ├── llm/ # LLM clients
│ │ ├── groq_client.py
│ │ └── ollama_client.py
│ ├── tts/ # Text-to-Speech
│ │ ├── azure_client.py
│ │ ├── piper_client.py
│ │ └── ssml_builder.py
│ ├── stt/ # Speech-to-Text
│ │ ├── groq_whisper_client.py
│ │ └── whisper_client.py
│ ├── memory/ # Conversation memory
│ │ ├── redis_client.py
│ │ └── vector_store.py
│ ├── nlp/ # NLP utilities
│ │ └── sentiment.py
│ ├── orchestrator.py # Main pipeline
│ └── fallback.py # Fallback manager
│
├── scripts/ # CLI tools
│ ├── main.py # Interactive CLI
│ └── transcribe.py # Transcription tool
│
└── .env # Configuration
| Component | Technology | Purpose |
|---|---|---|
| Frontend | Streamlit | Web UI |
| Backend | FastAPI | REST API |
| LLM | Groq (Llama 3.3 70B) | Response generation |
| STT | Groq Whisper | Speech recognition |
| TTS | Azure Neural TTS | Voice synthesis |
| Memory | Redis | Conversation storage |
| Embeddings | Sentence Transformers | Repetition detection |
| Fallback LLM | Ollama | Offline backup |
| Fallback TTS | Piper | Offline backup |
| Variable | Description | Required |
|---|---|---|
GROQ_API_KEY |
Groq API key for LLM & STT | Yes |
AZURE_SPEECH_KEY |
Azure Speech Services key | Yes |
AZURE_SPEECH_REGION |
Azure region (e.g., eastus) | Yes |
REDIS_HOST |
Redis server host | Yes |
REDIS_PORT |
Redis server port | Yes |
STT_BACKEND |
groq or local |
No (default: groq) |
BACKEND_API_URL |
Backend URL for frontend | No |
- Voice input with Whisper STT
- Intelligent responses with Llama 3
- Emotional TTS with Azure
- Conversation memory with Redis
- Sentiment analysis
- Fallback to local models
- Web UI with Streamlit
- Multi-language support
- Voice cloning
- Mobile app
- WebSocket real-time streaming
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made by the ConversaVoice Team