Skip to content

voice_assistant_guide

GitHub Actions edited this page Jan 2, 2026 · 1 revision

ThemisDB Voice Assistant - Complete Guide

Version: 1.0
Status: Enterprise Feature
Author: ThemisDB Team
Date: December 2025


Overview

ThemisDB Voice Assistant provides natural language voice interaction capabilities similar to Alexa or Siri, integrated directly into the database. It combines Speech-to-Text (STT), Text-to-Speech (TTS), and Large Language Models (LLM) to enable:

  • Voice Commands - Query and control the database using natural language
  • Phone Call Recording - Automatic transcription and storage of phone calls
  • Meeting Protocol Generation - AI-powered meeting minutes and action items
  • Voice Assistant Conversations - Interactive voice-based assistance

All recordings and transcriptions are stored securely in ThemisDB with full revision control and audit trails (Enterprise feature).


Architecture

┌─────────────────────────────────────────────────────────┐
│                   Voice Assistant                        │
│  ┌───────────┐   ┌───────────┐   ┌─────────────┐       │
│  │    STT    │   │    TTS    │   │     LLM     │       │
│  │ (Whisper) │   │  (Piper)  │   │ (llama.cpp) │       │
│  └───────────┘   └───────────┘   └─────────────┘       │
└────────────────────┬────────────────────────────────────┘
                     │
         ┌───────────┴───────────┐
         │                       │
    ┌────▼────┐            ┌────▼────┐
    │   API   │            │   WS    │
    │/api/v1/ │            │  /ws/   │
    │ voice   │            │ voice   │
    └─────────┘            └─────────┘
         │                       │
         └───────────┬───────────┘
                     │
         ┌───────────▼───────────┐
         │  ThemisDB Storage     │
         │  - Base Entities      │
         │  - Revision Control   │
         │  - Audit Logs         │
         └───────────────────────┘

Features

1. Speech-to-Text (STT)

Powered by Whisper.cpp for high-accuracy transcription:

  • Multi-language support (100+ languages with auto-detection)
  • Timestamp generation for segments
  • Speaker diarization (identify different speakers)
  • Word-level confidence scores
  • Real-time streaming transcription

Supported Audio Formats:

  • MP3, WAV, OGG, FLAC, AAC, M4A, Opus, WMA

Model Sizes:

  • tiny - 39M params, fast, good for real-time
  • base - 74M params, balanced (default)
  • small - 244M params, better accuracy
  • medium - 769M params, high accuracy
  • large - 1550M params, best accuracy

2. Text-to-Speech (TTS)

Powered by Piper TTS for natural-sounding voice synthesis:

  • Multiple voice profiles (male/female, different accents)
  • Adjustable speed and pitch
  • Multiple output formats (WAV, MP3, OGG)
  • High-quality neural synthesis
  • Real-time streaming synthesis

Available Voices:

  • English (US, UK, Australian)
  • German
  • Spanish
  • French
  • And more...

3. LLM Integration

Uses llama.cpp for natural language understanding:

  • Conversation context management
  • Meeting summary generation
  • Key points extraction
  • Action items identification
  • Natural language query processing

Quick Start

1. Enable Voice Assistant

Edit config/voice_assistant.yaml:

voice_assistant:
  enabled: true
  
  stt:
    model_path: "./models/ggml-base.bin"
    model_size: "base"
    language: "auto"
  
  tts:
    model_path: "./models/tts-model.bin"
    voice: "default"
  
  llm:
    model_path: "./models/llama-2-7b-chat.gguf"
    n_ctx: 4096

2. Start ThemisDB Server

./themis_server --config config.yaml --enable-voice-assistant

3. Test Voice Command

curl -X POST http://localhost:8080/api/v1/voice/command \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "What is the total revenue this month?",
    "session_id": "user123"
  }'

API Reference

Base URL

http://localhost:8080/api/v1/voice

Authentication

All endpoints require Bearer token authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Endpoints

1. Transcribe Audio

POST /api/v1/voice/transcribe

Convert audio to text.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "language": "auto",
  "timestamps": true,
  "speaker_diarization": false
}

Response:

{
  "success": true,
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "confidence": 0.95,
  "duration_ms": 3000,
  "segments": [
    {
      "text": "Hello, this is a test transcription.",
      "start_ms": 0,
      "end_ms": 3000,
      "confidence": 0.95
    }
  ]
}

2. Synthesize Speech

POST /api/v1/voice/synthesize

Convert text to speech.

Request:

{
  "text": "Hello, how can I help you today?",
  "voice": "default",
  "speed": 1.0,
  "format": "wav",
  "return_base64": true
}

Response:

{
  "success": true,
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "mime_type": "audio/wav",
  "duration_ms": 2500
}

3. Process Voice Command

POST /api/v1/voice/command

Process a voice or text command with LLM.

Request (Text):

{
  "text": "Show me the top 10 customers by revenue",
  "session_id": "user123"
}

Request (Audio):

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "session_id": "user123"
}

Response (Text):

{
  "success": true,
  "response": "Here are the top 10 customers by revenue...",
  "session_id": "user123"
}

Response (Audio):

{
  "success": true,
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "mime_type": "audio/wav",
  "session_id": "user123"
}

4. Record Phone Call

POST /api/v1/voice/call/record

Record and transcribe a phone call.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "call_id": "call-12345",
  "caller": "+1234567890",
  "callee": "+0987654321",
  "start_time": 1703000000000,
  "end_time": 1703003600000,
  "call_type": "inbound",
  "custom_fields": {
    "department": "Sales",
    "category": "Support"
  }
}

Response:

{
  "success": true,
  "call_id": "call-12345",
  "transcript": "Full transcription text...",
  "language": "en",
  "confidence": 0.95,
  "duration_ms": 3600000,
  "segments": [...],
  "summary": "Customer called regarding...",
  "document_id": "recording:abc123",
  "metadata": {
    "caller": "+1234567890",
    "callee": "+0987654321",
    "call_type": "inbound"
  }
}

5. Generate Meeting Protocol

POST /api/v1/voice/meeting/protocol

Generate a structured meeting protocol from audio recording.

Request:

{
  "audio_base64": "BASE64_ENCODED_AUDIO",
  "meeting_id": "meeting-789",
  "title": "Q4 Planning Meeting",
  "start_time": 1703000000000,
  "end_time": 1703007200000,
  "organizer": "john.doe@company.com",
  "participants": [
    "john.doe@company.com",
    "jane.smith@company.com",
    "bob.jones@company.com"
  ],
  "custom_fields": {
    "project": "Phoenix",
    "location": "Conference Room A"
  }
}

Response:

{
  "success": true,
  "meeting_id": "meeting-789",
  "title": "Q4 Planning Meeting",
  "transcript": "Full meeting transcript...",
  "summary": "The team discussed Q4 objectives...",
  "key_points": [
    "Launch new product in Q4",
    "Increase marketing budget by 20%",
    "Hire 3 new developers"
  ],
  "action_items": [
    {
      "description": "Prepare product launch plan",
      "status": "pending"
    },
    {
      "description": "Submit budget proposal",
      "status": "pending"
    }
  ],
  "segments": [...],
  "document_id": "recording:xyz789",
  "participants": [...],
  "duration_ms": 7200000
}

6. Get Available Voices

GET /api/v1/voice/voices

List available TTS voices.

Response:

{
  "voices": [
    {
      "id": "default",
      "name": "Default Voice",
      "language": "en",
      "gender": "neutral",
      "style": "professional"
    },
    {
      "id": "female_en",
      "name": "Female English",
      "language": "en",
      "gender": "female",
      "style": "friendly"
    }
  ]
}

7. Get Supported Languages

GET /api/v1/voice/languages

List supported languages for STT/TTS.

Response:

{
  "languages": [
    "en", "de", "es", "fr", "it", "pt", "ru", "zh", "ja", "ko"
  ]
}

8. Get Statistics

GET /api/v1/voice/stats

Get voice assistant statistics.

Response:

{
  "stt": {
    "transcriptions_completed": 1234,
    "total_audio_duration_ms": 3600000,
    "real_time_factor": 0.3
  },
  "tts": {
    "syntheses_completed": 567,
    "total_audio_duration_ms": 1800000
  },
  "llm": {
    "tokens_processed": 50000,
    "cache_hits": 1200,
    "avg_latency_ms": 150
  },
  "active_sessions": 5
}

9. Health Check

GET /api/v1/voice/health

Check voice assistant health.

Response:

{
  "status": "healthy",
  "voice_assistant": "available",
  "timestamp": 1703000000000
}

Use Cases

1. Phone Call Recording System

Record and transcribe customer support calls automatically:

import requests
import base64

# Read audio file
with open("call.mp3", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode()

# Record call
response = requests.post(
    "http://localhost:8080/api/v1/voice/call/record",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "audio_base64": audio_base64,
        "call_id": "call-12345",
        "caller": "+1234567890",
        "callee": "+0987654321",
        "call_type": "inbound"
    }
)

result = response.json()
print(f"Transcript: {result['transcript']}")
print(f"Summary: {result['summary']}")
print(f"Document ID: {result['document_id']}")

2. Meeting Minutes Generation

Automatically generate meeting protocols:

import requests
import base64

# Read meeting recording
with open("meeting.wav", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode()

# Generate protocol
response = requests.post(
    "http://localhost:8080/api/v1/voice/meeting/protocol",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "audio_base64": audio_base64,
        "meeting_id": "meeting-789",
        "title": "Sprint Planning",
        "participants": [
            "alice@company.com",
            "bob@company.com"
        ]
    }
)

result = response.json()
print(f"Summary: {result['summary']}")
print(f"Key Points: {result['key_points']}")
print(f"Action Items: {result['action_items']}")

3. Voice-Controlled Database Queries

Query the database using natural language:

import requests

response = requests.post(
    "http://localhost:8080/api/v1/voice/command",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "text": "Show me the total sales for last month",
        "session_id": "user123"
    }
)

result = response.json()
print(f"Response: {result['response']}")

Configuration

STT Configuration

stt:
  model:
    path: "./models/ggml-base.bin"
    size: "base"  # tiny, base, small, medium, large
    auto_download: true
  
  transcription:
    language: "auto"
    timestamps: true
    timestamp_granularity: "segment"
    word_confidence: false
  
  speaker_diarization:
    enabled: false
    num_speakers: 0  # 0 = auto-detect
  
  vad:
    enabled: true
    threshold: 0.5

TTS Configuration

tts:
  model:
    path: "./models/tts-model.bin"
    engine: "piper"
  
  synthesis:
    sample_rate: 22050
    speed: 1.0
    pitch: 1.0
    normalize: true
  
  output:
    format: "wav"
    quality: "medium"

LLM Configuration

llm:
  model_path: "./models/llama-2-7b-chat.gguf"
  n_ctx: 4096
  n_gpu_layers: 0  # 0 = CPU only
  temperature: 0.7
  top_p: 0.9

Storage and Revision Control

All recordings and transcriptions are stored in ThemisDB with:

  • Revision Control - Track changes over time
  • Audit Logs - Who accessed/modified what and when
  • Encryption - At-rest encryption for sensitive data
  • Compression - Automatic audio compression (OGG/MP3)
  • Metadata - Rich metadata for search and retrieval

Storage Path:

data/voice_recordings/
  ├── calls/
  │   ├── call-12345/
  │   │   ├── audio.ogg
  │   │   ├── transcript.txt
  │   │   └── metadata.json
  │   └── ...
  └── meetings/
      ├── meeting-789/
      │   ├── audio.ogg
      │   ├── protocol.md
      │   └── metadata.json
      └── ...

Security

Authentication

  • JWT Bearer token required for all API endpoints
  • Token validation on every request
  • Session-based access control

Privacy

  • PII detection and optional redaction
  • Configurable data retention policies
  • Automatic cleanup of old recordings
  • GDPR-compliant data handling

Audit Logging

All voice operations are logged:

  • Who initiated the request
  • What operation was performed
  • When it occurred
  • What data was accessed/modified

Performance

STT Performance

Model Speed Accuracy Memory
tiny 4x RT Good ~1 GB
base 1x RT Better ~1 GB
small 0.5x RT High ~2 GB
medium 0.3x RT Very High ~5 GB
large 0.2x RT Best ~10 GB

RT = Real-time (1x RT means 1 minute audio = 1 minute processing)

TTS Performance

  • ~50-100 characters/second synthesis
  • Real-time streaming capable
  • Low latency (<100ms for short phrases)

LLM Performance

  • Depends on model size and hardware
  • GPU acceleration recommended
  • ~20-50 tokens/second (typical)

Troubleshooting

Issue: STT model not found

Solution: Enable auto-download or manually download:

wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin \
  -O models/ggml-base.bin

Issue: High latency for transcription

Solution: Use smaller model (tiny/base) or enable GPU acceleration.

Issue: Poor transcription quality

Solution: Use larger model (medium/large) or ensure audio quality is good.


Enterprise Features

  • Horizontal Scaling - Distribute voice processing across nodes
  • High Availability - Redundant voice assistants
  • Advanced Analytics - Call analytics, sentiment analysis
  • Custom Voice Training - Train custom voices for your brand
  • Integration - Integrate with PBX systems, CRM, etc.

License Information

All core libraries used in the Voice Assistant are open-source with MIT License:

  • Whisper.cpp (STT) - MIT License
  • Piper TTS (TTS) - MIT License
  • llama.cpp (LLM) - MIT License
  • ONNX Runtime - MIT License

Suitable for commercial and on-premise use
No external API dependencies
Privacy-preserving (all processing local)

→ Complete License Documentation


Support

For issues or questions:


License

Voice Assistant is an Enterprise Feature of ThemisDB.

  • Community Edition: Limited to basic STT/TTS functionality
  • Enterprise Edition: Full features including phone call recording, meeting protocols, and advanced LLM integration

See LICENSE for details.

ThemisDB Dokumentation

Version: 1.3.0 | Stand: Dezember 2025


📋 Schnellstart


🏗️ Architektur


🗄️ Basismodell


💾 Storage & MVCC


📇 Indexe & Statistiken


🔍 Query & AQL


💰 Caching


📦 Content Pipeline


🔎 Suche


⚡ Performance & Benchmarks


🏢 Enterprise Features


✅ Qualitätssicherung


🧮 Vektor & GNN


🌍 Geo Features


🛡️ Sicherheit & Governance

Authentication

Schlüsselverwaltung

Verschlüsselung

TLS & Certificates

PKI & Signatures

PII Detection

Vault & HSM

Audit & Compliance

Security Audits

Gap Analysis


🚀 Deployment & Betrieb

Docker

Observability

Change Data Capture

Operations


💻 Entwicklung

API Implementations

Changefeed

Security Development

Development Overviews


📄 Publikation & Ablage


🔧 Admin-Tools


🔌 APIs


📚 Client SDKs


📊 Implementierungs-Zusammenfassungen


📅 Planung & Reports


📖 Dokumentation


📝 Release Notes


📖 Styleguide & Glossar


🗺️ Roadmap & Changelog


💾 Source Code Documentation

Main Programs

Source Code Module


🗄️ Archive


🤝 Community & Support


Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/

Clone this wiki locally