voice_assistant_guide

ThemisDB Voice Assistant - Complete Guide

Version: 1.0
Status: Enterprise Feature
Author: ThemisDB Team
Date: December 2025

Overview

ThemisDB Voice Assistant provides natural language voice interaction capabilities similar to Alexa or Siri, integrated directly into the database. It combines Speech-to-Text (STT), Text-to-Speech (TTS), and Large Language Models (LLM) to enable:

Voice Commands - Query and control the database using natural language
Phone Call Recording - Automatic transcription and storage of phone calls
Meeting Protocol Generation - AI-powered meeting minutes and action items
Voice Assistant Conversations - Interactive voice-based assistance

All recordings and transcriptions are stored securely in ThemisDB with full revision control and audit trails (Enterprise feature).

Architecture

┌─────────────────────────────────────────────────────────┐
│                   Voice Assistant                        │
│  ┌───────────┐   ┌───────────┐   ┌─────────────┐       │
│  │    STT    │   │    TTS    │   │     LLM     │       │
│  │ (Whisper) │   │  (Piper)  │   │ (llama.cpp) │       │
│  └───────────┘   └───────────┘   └─────────────┘       │
└────────────────────┬────────────────────────────────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
    ┌────▼────┐            ┌────▼────┐
    │   API   │            │   WS    │
    │/api/v1/ │            │  /ws/   │
    │ voice   │            │ voice   │
    └─────────┘            └─────────┘
         │                       │
         └───────────┬───────────┘
                     │
         ┌───────────▼───────────┐
         │  ThemisDB Storage     │
         │  - Base Entities      │
         │  - Revision Control   │
         │  - Audit Logs         │
         └───────────────────────┘

Features

1. Speech-to-Text (STT)

Powered by Whisper.cpp for high-accuracy transcription:

Multi-language support (100+ languages with auto-detection)
Timestamp generation for segments
Speaker diarization (identify different speakers)
Word-level confidence scores
Real-time streaming transcription

Supported Audio Formats:

MP3, WAV, OGG, FLAC, AAC, M4A, Opus, WMA

Model Sizes:

tiny - 39M params, fast, good for real-time
base - 74M params, balanced (default)
small - 244M params, better accuracy
medium - 769M params, high accuracy
large - 1550M params, best accuracy

2. Text-to-Speech (TTS)

Powered by Piper TTS for natural-sounding voice synthesis:

Multiple voice profiles (male/female, different accents)
Adjustable speed and pitch
Multiple output formats (WAV, MP3, OGG)
High-quality neural synthesis
Real-time streaming synthesis

Available Voices:

English (US, UK, Australian)
German
Spanish
French
And more...

3. LLM Integration

Uses llama.cpp for natural language understanding:

Conversation context management
Meeting summary generation
Key points extraction
Action items identification
Natural language query processing

Quick Start

1. Enable Voice Assistant

Edit config/voice_assistant.yaml:

voice_assistant:
  enabled: true
  
  stt:
    model_path: "./models/ggml-base.bin"
    model_size: "base"
    language: "auto"
  
  tts:
    model_path: "./models/tts-model.bin"
    voice: "default"
  
  llm:
    model_path: "./models/llama-2-7b-chat.gguf"
    n_ctx: 4096

2. Start ThemisDB Server

./themis_server --config config.yaml --enable-voice-assistant

3. Test Voice Command

curl -X POST http://localhost:8080/api/v1/voice/command \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "What is the total revenue this month?",
    "session_id": "user123"
  }'

API Reference

Base URL

http://localhost:8080/api/v1/voice

Authentication

All endpoints require Bearer token authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Endpoints

1. Transcribe Audio

POST /api/v1/voice/transcribe

Convert audio to text.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "language": "auto",
  "timestamps": true,
  "speaker_diarization": false
}

Response:

{
  "success": true,
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "confidence": 0.95,
  "duration_ms": 3000,
  "segments": [
    {
      "text": "Hello, this is a test transcription.",
      "start_ms": 0,
      "end_ms": 3000,
      "confidence": 0.95
    }
  ]
}

2. Synthesize Speech

POST /api/v1/voice/synthesize

Convert text to speech.

Request:

{
  "text": "Hello, how can I help you today?",
  "voice": "default",
  "speed": 1.0,
  "format": "wav",
  "return_base64": true
}

Response:

{
  "success": true,
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "mime_type": "audio/wav",
  "duration_ms": 2500
}

3. Process Voice Command

POST /api/v1/voice/command

Process a voice or text command with LLM.

Request (Text):

{
  "text": "Show me the top 10 customers by revenue",
  "session_id": "user123"
}

Request (Audio):

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "session_id": "user123"
}

Response (Text):

{
  "success": true,
  "response": "Here are the top 10 customers by revenue...",
  "session_id": "user123"
}

Response (Audio):

{
  "success": true,
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "mime_type": "audio/wav",
  "session_id": "user123"
}

4. Record Phone Call

POST /api/v1/voice/call/record

Record and transcribe a phone call.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "call_id": "call-12345",
  "caller": "+1234567890",
  "callee": "+0987654321",
  "start_time": 1703000000000,
  "end_time": 1703003600000,
  "call_type": "inbound",
  "custom_fields": {
    "department": "Sales",
    "category": "Support"
  }
}

Response:

{
  "success": true,
  "call_id": "call-12345",
  "transcript": "Full transcription text...",
  "language": "en",
  "confidence": 0.95,
  "duration_ms": 3600000,
  "segments": [...],
  "summary": "Customer called regarding...",
  "document_id": "recording:abc123",
  "metadata": {
    "caller": "+1234567890",
    "callee": "+0987654321",
    "call_type": "inbound"
  }
}

5. Generate Meeting Protocol

POST /api/v1/voice/meeting/protocol

Generate a structured meeting protocol from audio recording.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "meeting_id": "meeting-789",
  "title": "Q4 Planning Meeting",
  "start_time": 1703000000000,
  "end_time": 1703007200000,
  "organizer": "john.doe@company.com",
  "participants": [
    "john.doe@company.com",
    "jane.smith@company.com",
    "bob.jones@company.com"
  ],
  "custom_fields": {
    "project": "Phoenix",
    "location": "Conference Room A"
  }
}

Response:

{
  "success": true,
  "meeting_id": "meeting-789",
  "title": "Q4 Planning Meeting",
  "transcript": "Full meeting transcript...",
  "summary": "The team discussed Q4 objectives...",
  "key_points": [
    "Launch new product in Q4",
    "Increase marketing budget by 20%",
    "Hire 3 new developers"
  ],
  "action_items": [
    {
      "description": "Prepare product launch plan",
      "status": "pending"
    },
    {
      "description": "Submit budget proposal",
      "status": "pending"
    }
  ],
  "segments": [...],
  "document_id": "recording:xyz789",
  "participants": [...],
  "duration_ms": 7200000
}

6. Get Available Voices

GET /api/v1/voice/voices

List available TTS voices.

Response:

{
  "voices": [
    {
      "id": "default",
      "name": "Default Voice",
      "language": "en",
      "gender": "neutral",
      "style": "professional"
    },
    {
      "id": "female_en",
      "name": "Female English",
      "language": "en",
      "gender": "female",
      "style": "friendly"
    }
  ]
}

7. Get Supported Languages

GET /api/v1/voice/languages

List supported languages for STT/TTS.

Response:

{
  "languages": [
    "en", "de", "es", "fr", "it", "pt", "ru", "zh", "ja", "ko"
  ]
}

8. Get Statistics

GET /api/v1/voice/stats

Get voice assistant statistics.

Response:

{
  "stt": {
    "transcriptions_completed": 1234,
    "total_audio_duration_ms": 3600000,
    "real_time_factor": 0.3
  },
  "tts": {
    "syntheses_completed": 567,
    "total_audio_duration_ms": 1800000
  },
  "llm": {
    "tokens_processed": 50000,
    "cache_hits": 1200,
    "avg_latency_ms": 150
  },
  "active_sessions": 5
}

9. Health Check

GET /api/v1/voice/health

Check voice assistant health.

Response:

{
  "status": "healthy",
  "voice_assistant": "available",
  "timestamp": 1703000000000
}

Use Cases

1. Phone Call Recording System

Record and transcribe customer support calls automatically:

import requests
import base64

# Read audio file
with open("call.mp3", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode()

# Record call
response = requests.post(
    "http://localhost:8080/api/v1/voice/call/record",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "audio_base64": audio_base64,
        "call_id": "call-12345",
        "caller": "+1234567890",
        "callee": "+0987654321",
        "call_type": "inbound"
    }
)

result = response.json()
print(f"Transcript: {result['transcript']}")
print(f"Summary: {result['summary']}")
print(f"Document ID: {result['document_id']}")

2. Meeting Minutes Generation

Automatically generate meeting protocols:

import requests
import base64

# Read meeting recording
with open("meeting.wav", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode()

# Generate protocol
response = requests.post(
    "http://localhost:8080/api/v1/voice/meeting/protocol",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "audio_base64": audio_base64,
        "meeting_id": "meeting-789",
        "title": "Sprint Planning",
        "participants": [
            "alice@company.com",
            "bob@company.com"
        ]
    }
)

result = response.json()
print(f"Summary: {result['summary']}")
print(f"Key Points: {result['key_points']}")
print(f"Action Items: {result['action_items']}")

3. Voice-Controlled Database Queries

Query the database using natural language:

import requests

response = requests.post(
    "http://localhost:8080/api/v1/voice/command",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "text": "Show me the total sales for last month",
        "session_id": "user123"
    }
)

result = response.json()
print(f"Response: {result['response']}")

Configuration

STT Configuration

stt:
  model:
    path: "./models/ggml-base.bin"
    size: "base"  # tiny, base, small, medium, large
    auto_download: true
  
  transcription:
    language: "auto"
    timestamps: true
    timestamp_granularity: "segment"
    word_confidence: false
  
  speaker_diarization:
    enabled: false
    num_speakers: 0  # 0 = auto-detect
  
  vad:
    enabled: true
    threshold: 0.5

TTS Configuration

tts:
  model:
    path: "./models/tts-model.bin"
    engine: "piper"
  
  synthesis:
    sample_rate: 22050
    speed: 1.0
    pitch: 1.0
    normalize: true
  
  output:
    format: "wav"
    quality: "medium"

LLM Configuration

llm:
  model_path: "./models/llama-2-7b-chat.gguf"
  n_ctx: 4096
  n_gpu_layers: 0  # 0 = CPU only
  temperature: 0.7
  top_p: 0.9

Storage and Revision Control

All recordings and transcriptions are stored in ThemisDB with:

Revision Control - Track changes over time
Audit Logs - Who accessed/modified what and when
Encryption - At-rest encryption for sensitive data
Compression - Automatic audio compression (OGG/MP3)
Metadata - Rich metadata for search and retrieval

Storage Path:

data/voice_recordings/
  ├── calls/
  │   ├── call-12345/
  │   │   ├── audio.ogg
  │   │   ├── transcript.txt
  │   │   └── metadata.json
  │   └── ...
  └── meetings/
      ├── meeting-789/
      │   ├── audio.ogg
      │   ├── protocol.md
      │   └── metadata.json
      └── ...

Security

Authentication

JWT Bearer token required for all API endpoints
Token validation on every request
Session-based access control

Privacy

PII detection and optional redaction
Configurable data retention policies
Automatic cleanup of old recordings
GDPR-compliant data handling

Audit Logging

All voice operations are logged:

Who initiated the request
What operation was performed
When it occurred
What data was accessed/modified

Performance

STT Performance

Model	Speed	Accuracy	Memory
tiny	4x RT	Good	~1 GB
base	1x RT	Better	~1 GB
small	0.5x RT	High	~2 GB
medium	0.3x RT	Very High	~5 GB
large	0.2x RT	Best	~10 GB

RT = Real-time (1x RT means 1 minute audio = 1 minute processing)

TTS Performance

~50-100 characters/second synthesis
Real-time streaming capable
Low latency (<100ms for short phrases)

LLM Performance

Depends on model size and hardware
GPU acceleration recommended
~20-50 tokens/second (typical)

Troubleshooting

Issue: STT model not found

Solution: Enable auto-download or manually download:

wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin \
  -O models/ggml-base.bin

Issue: High latency for transcription

Solution: Use smaller model (tiny/base) or enable GPU acceleration.

Issue: Poor transcription quality

Solution: Use larger model (medium/large) or ensure audio quality is good.

Enterprise Features

Horizontal Scaling - Distribute voice processing across nodes
High Availability - Redundant voice assistants
Advanced Analytics - Call analytics, sentiment analysis
Custom Voice Training - Train custom voices for your brand
Integration - Integrate with PBX systems, CRM, etc.

License Information

All core libraries used in the Voice Assistant are open-source with MIT License:

Whisper.cpp (STT) - MIT License
Piper TTS (TTS) - MIT License
llama.cpp (LLM) - MIT License
ONNX Runtime - MIT License

✅ Suitable for commercial and on-premise use
✅ No external API dependencies
✅ Privacy-preserving (all processing local)

→ Complete License Documentation

Support

For issues or questions:

GitHub Issues: https://github.com/makr-code/ThemisDB/issues
Documentation: https://makr-code.github.io/ThemisDB/
Enterprise Support: sales@themisdb.com

License

Voice Assistant is an Enterprise Feature of ThemisDB.

Community Edition: Limited to basic STT/TTS functionality
Enterprise Edition: Full features including phone call recording, meeting protocols, and advanced LLM integration

voice_assistant_guide

ThemisDB Voice Assistant - Complete Guide

Overview

Architecture

Features

1. Speech-to-Text (STT)

2. Text-to-Speech (TTS)

3. LLM Integration

Quick Start

1. Enable Voice Assistant

2. Start ThemisDB Server

3. Test Voice Command

API Reference

Base URL

Authentication

Endpoints

1. Transcribe Audio

2. Synthesize Speech

3. Process Voice Command

4. Record Phone Call

5. Generate Meeting Protocol

6. Get Available Voices

7. Get Supported Languages

8. Get Statistics

9. Health Check

Use Cases

1. Phone Call Recording System

2. Meeting Minutes Generation

3. Voice-Controlled Database Queries

Configuration

STT Configuration

TTS Configuration

LLM Configuration

Storage and Revision Control

Security

Authentication

Privacy

Audit Logging

Performance

STT Performance

TTS Performance

LLM Performance

Troubleshooting

Issue: STT model not found

Issue: High latency for transcription

Issue: Poor transcription quality

Enterprise Features

License Information

Support

License

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!