-
Notifications
You must be signed in to change notification settings - Fork 1
voice_assistant_guide
Version: 1.0
Status: Enterprise Feature
Author: ThemisDB Team
Date: December 2025
ThemisDB Voice Assistant provides natural language voice interaction capabilities similar to Alexa or Siri, integrated directly into the database. It combines Speech-to-Text (STT), Text-to-Speech (TTS), and Large Language Models (LLM) to enable:
- Voice Commands - Query and control the database using natural language
- Phone Call Recording - Automatic transcription and storage of phone calls
- Meeting Protocol Generation - AI-powered meeting minutes and action items
- Voice Assistant Conversations - Interactive voice-based assistance
All recordings and transcriptions are stored securely in ThemisDB with full revision control and audit trails (Enterprise feature).
┌─────────────────────────────────────────────────────────┐
│ Voice Assistant │
│ ┌───────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ STT │ │ TTS │ │ LLM │ │
│ │ (Whisper) │ │ (Piper) │ │ (llama.cpp) │ │
│ └───────────┘ └───────────┘ └─────────────┘ │
└────────────────────┬────────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ API │ │ WS │
│/api/v1/ │ │ /ws/ │
│ voice │ │ voice │
└─────────┘ └─────────┘
│ │
└───────────┬───────────┘
│
┌───────────▼───────────┐
│ ThemisDB Storage │
│ - Base Entities │
│ - Revision Control │
│ - Audit Logs │
└───────────────────────┘
Powered by Whisper.cpp for high-accuracy transcription:
- Multi-language support (100+ languages with auto-detection)
- Timestamp generation for segments
- Speaker diarization (identify different speakers)
- Word-level confidence scores
- Real-time streaming transcription
Supported Audio Formats:
- MP3, WAV, OGG, FLAC, AAC, M4A, Opus, WMA
Model Sizes:
-
tiny- 39M params, fast, good for real-time -
base- 74M params, balanced (default) -
small- 244M params, better accuracy -
medium- 769M params, high accuracy -
large- 1550M params, best accuracy
Powered by Piper TTS for natural-sounding voice synthesis:
- Multiple voice profiles (male/female, different accents)
- Adjustable speed and pitch
- Multiple output formats (WAV, MP3, OGG)
- High-quality neural synthesis
- Real-time streaming synthesis
Available Voices:
- English (US, UK, Australian)
- German
- Spanish
- French
- And more...
Uses llama.cpp for natural language understanding:
- Conversation context management
- Meeting summary generation
- Key points extraction
- Action items identification
- Natural language query processing
Edit config/voice_assistant.yaml:
voice_assistant:
enabled: true
stt:
model_path: "./models/ggml-base.bin"
model_size: "base"
language: "auto"
tts:
model_path: "./models/tts-model.bin"
voice: "default"
llm:
model_path: "./models/llama-2-7b-chat.gguf"
n_ctx: 4096./themis_server --config config.yaml --enable-voice-assistantcurl -X POST http://localhost:8080/api/v1/voice/command \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "What is the total revenue this month?",
"session_id": "user123"
}'http://localhost:8080/api/v1/voice
All endpoints require Bearer token authentication:
Authorization: Bearer YOUR_JWT_TOKEN
POST /api/v1/voice/transcribe
Convert audio to text.
Request:
{
"audio_base64": "BASE64_ENCODED_AUDIO",
"language": "auto",
"timestamps": true,
"speaker_diarization": false
}Response:
{
"success": true,
"text": "Hello, this is a test transcription.",
"language": "en",
"confidence": 0.95,
"duration_ms": 3000,
"segments": [
{
"text": "Hello, this is a test transcription.",
"start_ms": 0,
"end_ms": 3000,
"confidence": 0.95
}
]
}POST /api/v1/voice/synthesize
Convert text to speech.
Request:
{
"text": "Hello, how can I help you today?",
"voice": "default",
"speed": 1.0,
"format": "wav",
"return_base64": true
}Response:
{
"success": true,
"audio_base64": "BASE64_ENCODED_AUDIO",
"mime_type": "audio/wav",
"duration_ms": 2500
}POST /api/v1/voice/command
Process a voice or text command with LLM.
Request (Text):
{
"text": "Show me the top 10 customers by revenue",
"session_id": "user123"
}Request (Audio):
{
"audio_base64": "BASE64_ENCODED_AUDIO",
"session_id": "user123"
}Response (Text):
{
"success": true,
"response": "Here are the top 10 customers by revenue...",
"session_id": "user123"
}Response (Audio):
{
"success": true,
"audio_base64": "BASE64_ENCODED_AUDIO",
"mime_type": "audio/wav",
"session_id": "user123"
}POST /api/v1/voice/call/record
Record and transcribe a phone call.
Request:
{
"audio_base64": "BASE64_ENCODED_AUDIO",
"call_id": "call-12345",
"caller": "+1234567890",
"callee": "+0987654321",
"start_time": 1703000000000,
"end_time": 1703003600000,
"call_type": "inbound",
"custom_fields": {
"department": "Sales",
"category": "Support"
}
}Response:
{
"success": true,
"call_id": "call-12345",
"transcript": "Full transcription text...",
"language": "en",
"confidence": 0.95,
"duration_ms": 3600000,
"segments": [...],
"summary": "Customer called regarding...",
"document_id": "recording:abc123",
"metadata": {
"caller": "+1234567890",
"callee": "+0987654321",
"call_type": "inbound"
}
}POST /api/v1/voice/meeting/protocol
Generate a structured meeting protocol from audio recording.
Request:
{
"audio_base64": "BASE64_ENCODED_AUDIO",
"meeting_id": "meeting-789",
"title": "Q4 Planning Meeting",
"start_time": 1703000000000,
"end_time": 1703007200000,
"organizer": "john.doe@company.com",
"participants": [
"john.doe@company.com",
"jane.smith@company.com",
"bob.jones@company.com"
],
"custom_fields": {
"project": "Phoenix",
"location": "Conference Room A"
}
}Response:
{
"success": true,
"meeting_id": "meeting-789",
"title": "Q4 Planning Meeting",
"transcript": "Full meeting transcript...",
"summary": "The team discussed Q4 objectives...",
"key_points": [
"Launch new product in Q4",
"Increase marketing budget by 20%",
"Hire 3 new developers"
],
"action_items": [
{
"description": "Prepare product launch plan",
"status": "pending"
},
{
"description": "Submit budget proposal",
"status": "pending"
}
],
"segments": [...],
"document_id": "recording:xyz789",
"participants": [...],
"duration_ms": 7200000
}GET /api/v1/voice/voices
List available TTS voices.
Response:
{
"voices": [
{
"id": "default",
"name": "Default Voice",
"language": "en",
"gender": "neutral",
"style": "professional"
},
{
"id": "female_en",
"name": "Female English",
"language": "en",
"gender": "female",
"style": "friendly"
}
]
}GET /api/v1/voice/languages
List supported languages for STT/TTS.
Response:
{
"languages": [
"en", "de", "es", "fr", "it", "pt", "ru", "zh", "ja", "ko"
]
}GET /api/v1/voice/stats
Get voice assistant statistics.
Response:
{
"stt": {
"transcriptions_completed": 1234,
"total_audio_duration_ms": 3600000,
"real_time_factor": 0.3
},
"tts": {
"syntheses_completed": 567,
"total_audio_duration_ms": 1800000
},
"llm": {
"tokens_processed": 50000,
"cache_hits": 1200,
"avg_latency_ms": 150
},
"active_sessions": 5
}GET /api/v1/voice/health
Check voice assistant health.
Response:
{
"status": "healthy",
"voice_assistant": "available",
"timestamp": 1703000000000
}Record and transcribe customer support calls automatically:
import requests
import base64
# Read audio file
with open("call.mp3", "rb") as f:
audio_data = f.read()
audio_base64 = base64.b64encode(audio_data).decode()
# Record call
response = requests.post(
"http://localhost:8080/api/v1/voice/call/record",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"audio_base64": audio_base64,
"call_id": "call-12345",
"caller": "+1234567890",
"callee": "+0987654321",
"call_type": "inbound"
}
)
result = response.json()
print(f"Transcript: {result['transcript']}")
print(f"Summary: {result['summary']}")
print(f"Document ID: {result['document_id']}")Automatically generate meeting protocols:
import requests
import base64
# Read meeting recording
with open("meeting.wav", "rb") as f:
audio_data = f.read()
audio_base64 = base64.b64encode(audio_data).decode()
# Generate protocol
response = requests.post(
"http://localhost:8080/api/v1/voice/meeting/protocol",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"audio_base64": audio_base64,
"meeting_id": "meeting-789",
"title": "Sprint Planning",
"participants": [
"alice@company.com",
"bob@company.com"
]
}
)
result = response.json()
print(f"Summary: {result['summary']}")
print(f"Key Points: {result['key_points']}")
print(f"Action Items: {result['action_items']}")Query the database using natural language:
import requests
response = requests.post(
"http://localhost:8080/api/v1/voice/command",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"text": "Show me the total sales for last month",
"session_id": "user123"
}
)
result = response.json()
print(f"Response: {result['response']}")stt:
model:
path: "./models/ggml-base.bin"
size: "base" # tiny, base, small, medium, large
auto_download: true
transcription:
language: "auto"
timestamps: true
timestamp_granularity: "segment"
word_confidence: false
speaker_diarization:
enabled: false
num_speakers: 0 # 0 = auto-detect
vad:
enabled: true
threshold: 0.5tts:
model:
path: "./models/tts-model.bin"
engine: "piper"
synthesis:
sample_rate: 22050
speed: 1.0
pitch: 1.0
normalize: true
output:
format: "wav"
quality: "medium"llm:
model_path: "./models/llama-2-7b-chat.gguf"
n_ctx: 4096
n_gpu_layers: 0 # 0 = CPU only
temperature: 0.7
top_p: 0.9All recordings and transcriptions are stored in ThemisDB with:
- Revision Control - Track changes over time
- Audit Logs - Who accessed/modified what and when
- Encryption - At-rest encryption for sensitive data
- Compression - Automatic audio compression (OGG/MP3)
- Metadata - Rich metadata for search and retrieval
Storage Path:
data/voice_recordings/
├── calls/
│ ├── call-12345/
│ │ ├── audio.ogg
│ │ ├── transcript.txt
│ │ └── metadata.json
│ └── ...
└── meetings/
├── meeting-789/
│ ├── audio.ogg
│ ├── protocol.md
│ └── metadata.json
└── ...
- JWT Bearer token required for all API endpoints
- Token validation on every request
- Session-based access control
- PII detection and optional redaction
- Configurable data retention policies
- Automatic cleanup of old recordings
- GDPR-compliant data handling
All voice operations are logged:
- Who initiated the request
- What operation was performed
- When it occurred
- What data was accessed/modified
| Model | Speed | Accuracy | Memory |
|---|---|---|---|
| tiny | 4x RT | Good | ~1 GB |
| base | 1x RT | Better | ~1 GB |
| small | 0.5x RT | High | ~2 GB |
| medium | 0.3x RT | Very High | ~5 GB |
| large | 0.2x RT | Best | ~10 GB |
RT = Real-time (1x RT means 1 minute audio = 1 minute processing)
- ~50-100 characters/second synthesis
- Real-time streaming capable
- Low latency (<100ms for short phrases)
- Depends on model size and hardware
- GPU acceleration recommended
- ~20-50 tokens/second (typical)
Solution: Enable auto-download or manually download:
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin \
-O models/ggml-base.binSolution: Use smaller model (tiny/base) or enable GPU acceleration.
Solution: Use larger model (medium/large) or ensure audio quality is good.
- Horizontal Scaling - Distribute voice processing across nodes
- High Availability - Redundant voice assistants
- Advanced Analytics - Call analytics, sentiment analysis
- Custom Voice Training - Train custom voices for your brand
- Integration - Integrate with PBX systems, CRM, etc.
All core libraries used in the Voice Assistant are open-source with MIT License:
- Whisper.cpp (STT) - MIT License
- Piper TTS (TTS) - MIT License
- llama.cpp (LLM) - MIT License
- ONNX Runtime - MIT License
✅ Suitable for commercial and on-premise use
✅ No external API dependencies
✅ Privacy-preserving (all processing local)
→ Complete License Documentation
For issues or questions:
- GitHub Issues: https://github.com/makr-code/ThemisDB/issues
- Documentation: https://makr-code.github.io/ThemisDB/
- Enterprise Support: sales@themisdb.com
Voice Assistant is an Enterprise Feature of ThemisDB.
- Community Edition: Limited to basic STT/TTS functionality
- Enterprise Edition: Full features including phone call recording, meeting protocols, and advanced LLM integration
See LICENSE for details.
ThemisDB v1.3.4 | GitHub | Documentation | Discussions | License
Last synced: January 02, 2026 | Commit: 6add659
Version: 1.3.0 | Stand: Dezember 2025
- Übersicht
- Home
- Dokumentations-Index
- Quick Reference
- Sachstandsbericht 2025
- Features
- Roadmap
- Ecosystem Overview
- Strategische Übersicht
- Geo/Relational Storage
- RocksDB Storage
- MVCC Design
- Transaktionen
- Time-Series
- Memory Tuning
- Chain of Thought Storage
- Query Engine & AQL
- AQL Syntax
- Explain & Profile
- Rekursive Pfadabfragen
- Temporale Graphen
- Zeitbereichs-Abfragen
- Semantischer Cache
- Hybrid Queries (Phase 1.5)
- AQL Hybrid Queries
- Hybrid Queries README
- Hybrid Query Benchmarks
- Subquery Quick Reference
- Subquery Implementation
- Content Pipeline
- Architektur-Details
- Ingestion
- JSON Ingestion Spec
- Enterprise Ingestion Interface
- Geo-Processor Design
- Image-Processor Design
- Hybrid Search Design
- Fulltext API
- Hybrid Fusion API
- Stemming
- Performance Tuning
- Migration Guide
- Future Work
- Pagination Benchmarks
- Enterprise README
- Scalability Features
- HTTP Client Pool
- Build Guide
- Implementation Status
- Final Report
- Integration Analysis
- Enterprise Strategy
- Verschlüsselungsstrategie
- Verschlüsselungsdeployment
- Spaltenverschlüsselung
- Encryption Next Steps
- Multi-Party Encryption
- Key Rotation Strategy
- Security Encryption Gap Analysis
- Audit Logging
- Audit & Retention
- Compliance Audit
- Compliance
- Extended Compliance Features
- Governance-Strategie
- Compliance-Integration
- Governance Usage
- Security/Compliance Review
- Threat Model
- Security Hardening Guide
- Security Audit Checklist
- Security Audit Report
- Security Implementation
- Development README
- Code Quality Pipeline
- Developers Guide
- Cost Models
- Todo Liste
- Tool Todo
- Core Feature Todo
- Priorities
- Implementation Status
- Roadmap
- Future Work
- Next Steps Analysis
- AQL LET Implementation
- Development Audit
- Sprint Summary (2025-11-17)
- WAL Archiving
- Search Gap Analysis
- Source Documentation Plan
- Changefeed README
- Changefeed CMake Patch
- Changefeed OpenAPI
- Changefeed OpenAPI Auth
- Changefeed SSE Examples
- Changefeed Test Harness
- Changefeed Tests
- Dokumentations-Inventar
- Documentation Summary
- Documentation TODO
- Documentation Gap Analysis
- Documentation Consolidation
- Documentation Final Status
- Documentation Phase 3
- Documentation Cleanup Validation
- API
- Authentication
- Cache
- CDC
- Content
- Geo
- Governance
- Index
- LLM
- Query
- Security
- Server
- Storage
- Time Series
- Transaction
- Utils
Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/