Name	Name	Last commit message	Last commit date
parent directory ..
alerts	alerts
dashboards	dashboards
provisioning	provisioning
COMPLIANCE_EXPORTER_README.md	COMPLIANCE_EXPORTER_README.md
PERFORMANCE_DASHBOARD_README.md	PERFORMANCE_DASHBOARD_README.md
README.md	README.md
advanced-features.json	advanced-features.json
cache-performance.json	cache-performance.json
compliance_exporter.py	compliance_exporter.py
content-processing.json	content-processing.json
cycle_metrics_dashboard.json	cycle_metrics_dashboard.json
docker-compose.yml	docker-compose.yml
gossip-network.json	gossip-network.json
ingestion-alerts.json	ingestion-alerts.json
ingestion-dashboard.json	ingestion-dashboard.json
performance-dashboard.json	performance-dashboard.json
prometheus.yml	prometheus.yml
query-performance.json	query-performance.json
replication-ha.json	replication-ha.json
security-authentication.json	security-authentication.json
sharding-distribution.json	sharding-distribution.json
siem-security-monitoring.json	siem-security-monitoring.json
storage-backup.json	storage-backup.json
system-overview.json	system-overview.json
timeseries-store.json	timeseries-store.json

Grafana Integration für ThemisDB

Umfassendes Monitoring und Visualisierung für ThemisDB - LLM-Subsystem und SIEM-Sicherheitsüberwachung.

Übersicht

Diese Grafana-Integration bietet Echtzeit-Monitoring für:

LLM/llama.cpp Monitoring

Inference Performance (Latenz, Throughput, Tokens/sec)
GPU Metriken (Memory, Utilization, Temperature)
Model Management (Loaded Models, Memory Usage)
Cache Performance (Hit Rates, Efficiency)
Scheduler Status (Queue Length, Batch Size, Preemptions)
Error Tracking (Fehlerrate, Error Types)

SIEM Security Monitoring (NEU)

Authentication & Authorization (Failed Logins, Privilege Escalation, Rate Limiting)
Audit & Security Events (CRUD, Admin Actions, Policy Checks, Security Incidents)
Query & Performance (Error Rate, Slow Queries, Request Rate, Cache Hit Rate)
Infrastructure (CPU, Memory, Storage, Network, Replication)
Compliance (SOC2, GDPR, HIPAA)

Verzeichnisstruktur

grafana/
├── dashboards/
│   ├── themisdb-llm-dashboard.json           # LLM Monitoring Dashboard
│   └── sla-monitoring.json
├── siem-security-monitoring.json             # SIEM Security Dashboard (NEU)
├── alerts/
│   ├── graph_security.yaml                   # Graph Security Alerts
│   └── siem_security_alerts.yaml             # SIEM Security Alerts (NEU)
├── provisioning/
│   ├── datasources/
│   │   └── prometheus.yml                    # Prometheus Datasource
│   ├── dashboards/
│   │   └── dashboards.yml                    # Dashboard Provisioning
│   └── alerts.yml                            # Alert Rules
├── compliance_exporter.py                    # Compliance Report Generator (NEU)
├── COMPLIANCE_EXPORTER_README.md             # Compliance Exporter Documentation (NEU)
├── docker-compose.yml                        # Docker Setup
├── prometheus.yml                            # Prometheus Configuration
└── README.md                                 # Diese Datei

Quick Start

Option 1: Docker Compose (Empfohlen)

cd grafana
docker-compose up -d

Öffne Browser:

Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
ThemisDB Metrics: http://localhost:9091/metrics

Option 2: Manuelle Installation

1. Prometheus Setup

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

# Create config
cat > prometheus.yml <<EOF
global:
  scrape_interval: 5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'themisdb-llm'
    static_configs:
      - targets: ['localhost:9091']
    metrics_path: '/metrics'
    
rule_files:
  - 'alerts.yml'
EOF

# Copy alert rules
cp ../provisioning/alerts.yml .

# Start Prometheus
./prometheus --config.file=prometheus.yml

2. Grafana Setup

# Install Grafana
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana

# Copy provisioning files
sudo cp -r provisioning/* /etc/grafana/provisioning/
sudo cp dashboards/*.json /etc/grafana/provisioning/dashboards/

# Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

3. ThemisDB Metrics aktivieren

#include "llm/grafana_metrics.h"

using namespace themis::llm::monitoring;

// Initialize
PrometheusExporter exporter;
LLMMetricsCollector metrics(&exporter);

// Start Metrics Server
MetricsServer::ServerConfig config;
config.port = 9091;  // ThemisDB metrics port
MetricsServer server(config, &exporter);
server.start();

// Record metrics
metrics.recordInferenceRequest("mistral-7b");
metrics.recordFirstTokenLatency("mistral-7b", 72.5);
metrics.recordGPUMemoryUsage(4096, 24576);

Dashboard Import

Auto-Import (mit Provisioning)

Dashboard wird automatisch geladen wenn Grafana mit Provisioning-Konfiguration startet.

Manueller Import

Öffne Grafana (http://localhost:3000)
Login: admin / admin
Gehe zu Dashboards → Import
Lade dashboards/themisdb-llm-dashboard.json
Wähle Prometheus als Datasource
Klicke Import

Verfügbare Metriken

Inference Metriken

Metrik	Typ	Beschreibung	Labels
`llm_inference_requests_total`	Counter	Gesamtzahl Inference Requests	`model_id`
`llm_inference_duration_ms`	Histogram	Inference Dauer in ms	`model_id`
`llm_inference_failures_total`	Counter	Fehlgeschlagene Requests	`model_id`, `error`

Latenz Metriken

Metrik	Typ	Beschreibung	Labels
`llm_first_token_latency_ms`	Histogram	Zeit bis erstes Token	`model_id`
`llm_per_token_latency_ms`	Histogram	Latenz pro Token	`model_id`

Throughput Metriken

Metrik	Typ	Beschreibung	Labels
`llm_tokens_generated_total`	Counter	Generierte Tokens	`model_id`
`llm_batch_size`	Gauge	Aktuelle Batch-Größe	-
`llm_concurrent_requests`	Gauge	Gleichzeitige Requests	-

GPU Metriken

Metrik	Typ	Beschreibung	Labels
`llm_gpu_memory_used_mb`	Gauge	GPU Speicher verwendet (MB)	-
`llm_gpu_memory_total_mb`	Gauge	GPU Speicher gesamt (MB)	-
`llm_gpu_utilization_pct`	Gauge	GPU Auslastung (%)	-
`llm_gpu_temperature_celsius`	Gauge	GPU Temperatur (°C)	-

Model Metriken

Metrik	Typ	Beschreibung	Labels
`llm_models_loaded`	Gauge	Anzahl geladener Modelle	-
`llm_model_memory_mb`	Gauge	Speicher pro Modell (MB)	`model_id`

Cache Metriken

Metrik	Typ	Beschreibung	Labels
`llm_cache_hits_total`	Counter	Cache Hits	`cache_type`
`llm_cache_misses_total`	Counter	Cache Misses	`cache_type`

Scheduler Metriken

Metrik	Typ	Beschreibung	Labels
`llm_scheduler_queue_length`	Gauge	Warteschlangen-Länge	-
`llm_scheduler_preemptions_total`	Counter	Preemptions	-

Error Metriken

Metrik	Typ	Beschreibung	Labels
`llm_errors_total`	Counter	Fehler nach Typ	`error_type`, `component`

Dashboard Panels

1. Inference Requests per Second

Query: rate(llm_inference_requests_total[1m])
Zeigt: Request-Rate pro Model
Alert: >100 req/s

2. Latency Distribution

Query: histogram_quantile(0.95, llm_first_token_latency_ms_bucket)
Zeigt: P50, P95, P99 First Token Latency

3. GPU Memory Usage

Query: llm_gpu_memory_used_mb / llm_gpu_memory_total_mb * 100
Zeigt: GPU Memory % Auslastung

4. Throughput

Query: rate(llm_tokens_generated_total[1m])
Zeigt: Tokens/sec pro Model

5. Cache Hit Rate

Query: llm_cache_hits / (llm_cache_hits + llm_cache_misses) * 100
Zeigt: Cache Effizienz

PromQL Beispiele

Durchschnittliche Latenz (5min)

avg(rate(llm_first_token_latency_ms_sum[5m])) 
  / 
avg(rate(llm_first_token_latency_ms_count[5m]))

Erfolgsrate

(1 - (rate(llm_inference_failures_total[5m]) 
      / 
      rate(llm_inference_requests_total[5m]))) * 100

GPU Speicher Auslastung

(llm_gpu_memory_used_mb / llm_gpu_memory_total_mb) * 100

Tokens pro Request

rate(llm_tokens_generated_total[1m]) 
  / 
rate(llm_inference_requests_total[1m])

Alerts

Vorkonfigurierte Alerts in provisioning/alerts.yml:

Latency Alerts

HighFirstTokenLatency: P95 > 100ms für 5min (Warning)
CriticalFirstTokenLatency: P95 > 200ms für 2min (Critical)

Error Alerts

HighErrorRate: >5% für 5min (Warning)
CriticalErrorRate: >10% für 2min (Critical)

GPU Alerts

GPUMemoryHigh: >85% für 5min (Warning)
GPUMemoryCritical: >95% für 2min (Critical)
GPUTemperatureHigh: >80°C für 5min (Warning)
GPUTemperatureCritical: >90°C für 1min (Critical)

System Alerts

LowThroughput: <100 tokens/sec für 10min (Warning)
HighQueueLength: >50 requests für 5min (Warning)
NoInferenceRequests: Keine Requests für 10min (Warning)

Integration mit Code

LazyModelLoader

void LazyModelLoader::loadModelInternal(...) {
    metrics_->recordModelLoaded(model_id, vram_mb);
}

void LazyModelLoader::unloadModel(...) {
    metrics_->recordModelUnloaded(model_id);
}

ContinuousBatchScheduler

void ContinuousBatchScheduler::scheduleNextBatch() {
    auto start = std::chrono::steady_clock::now();
    // ... scheduling ...
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
        std::chrono::steady_clock::now() - start
    ).count();
    
    metrics_->recordSchedulingLatency(duration);
    metrics_->recordQueueLength(waiting_queue_.size());
    metrics_->recordBatchSize(batch.size());
}

GPU Memory Manager

void GPUMemoryManager::allocateGPU(...) {
    // ... allocation ...
    auto stats = getStats();
    metrics_->recordGPUMemoryUsage(
        stats.used_vram_bytes / (1024*1024),
        stats.total_vram_bytes / (1024*1024)
    );
}

Troubleshooting

Metriken werden nicht angezeigt

# Check metrics endpoint
curl http://localhost:9091/metrics

# Check Prometheus targets
open http://localhost:9090/targets

# Check ThemisDB logs
docker logs themisdb -f

Dashboard zeigt keine Daten

Prüfe Datasource: Configuration → Data Sources → Prometheus
Teste Connection: Save & Test sollte "Data source is working"
Prüfe Time Range: Letzte 1 Stunde
Prüfe Query in Panel Edit Mode

Alerts feuern nicht

# Check Prometheus rules
open http://localhost:9090/rules

# Check Alertmanager
open http://localhost:9093

# Validate alert syntax
promtool check rules provisioning/alerts.yml

Best Practices

1. Scrape Interval

Empfohlen: 5-10 Sekunden
Config: scrape_interval: 5s in prometheus.yml

2. Retention

Empfohlen: 15 Tage
Config: --storage.tsdb.retention.time=15d
Disk: ~50 MB/Tag → ~750 MB für 15 Tage

3. Label Cardinality

Limit: Halte model_id unter 100 unique values
Avoid: Freie Strings als Labels (nutze feste Werte)

4. Recording Rules

Für häufige Queries:

groups:
  - name: llm_recording_rules
    interval: 30s
    rules:
      - record: llm:inference_rate:1m
        expr: rate(llm_inference_requests_total[1m])
      
      - record: llm:success_rate:5m
        expr: (1 - (rate(llm_inference_failures_total[5m]) / rate(llm_inference_requests_total[5m]))) * 100

SIEM Security Monitoring (NEU)

Verfügbare SIEM-Dashboards

ThemisDB SIEM Security Monitoring (siem-security-monitoring.json)

Umfassendes Security-Dashboard für SOC-Teams
4 Hauptbereiche: Authentication, Audit Events, Query Performance, Infrastructure
Echtzeit-Bedrohungserkennung
Compliance-Mapping (SOC2, GDPR, HIPAA)

Dokumentation

Integration Guide: docs/en/observability/siem_integration.md (English)
Integration Guide: docs/de/observability/siem_integration.md (Deutsch)
User Guide: docs/en/observability/siem_dashboard_user_guide.md
Compliance Exporter: COMPLIANCE_EXPORTER_README.md

SIEM-relevante Alarme

Alle SIEM-Alarme sind in alerts/siem_security_alerts.yaml definiert:

Kritische Alarme:

BruteForceAttackDetected - Mehrfache fehlgeschlagene Logins
PrivilegeEscalationDetected - Unbefugte Rechteänderungen
UnauthorizedDataExport - Datenexfiltration
AuditLogTamperingAttempt - Audit-Log-Manipulation

Compliance-Alarme:

GDPRDataRetentionViolation - DSGVO-Datenhaltung
SOC2AuditLogGapDetected - SOC2-Audit-Lücken
EncryptionKeyRotationOverdue - Schlüsselrotation überfällig
BackupFailure - Backup-Fehler

Compliance-Reporting

Automatische Compliance-Berichte generieren:

# SOC2-Bericht für letzte 30 Tage (PDF)
python3 compliance_exporter.py --framework soc2 --period 30d

# GDPR-Bericht (JSON)
python3 compliance_exporter.py --framework gdpr --period 7d --format json

# HIPAA-Bericht (CSV)
python3 compliance_exporter.py --framework hipaa --period 90d --format csv

Siehe COMPLIANCE_EXPORTER_README.md für Details.

Integration mit SIEM-Systemen

Splunk:

pip install prometheus-splunk-exporter
# Siehe docs/en/observability/siem_integration.md für Konfiguration

ELK Stack:

Logstash-Konfiguration verfügbar
Siehe docs/en/observability/siem_integration.md

Syslog (RFC 5424):

Native Unterstützung in ThemisDB
Strukturierte Ereignisse mit Compliance-Kontext

Support

GitHub Issues: https://github.com/makr-code/ThemisDB/issues
Security Contact: security@themisdb.io
Metrics Endpoint: http://localhost:9091/metrics
Prometheus UI: http://localhost:9090
Grafana UI: http://localhost:3000

FilesExpand file tree

grafana

Directory actions

More options

Directory actions

More options

Latest commit

History

grafana

Folders and files

parent directory

README.md

Grafana Integration für ThemisDB

Übersicht

LLM/llama.cpp Monitoring

SIEM Security Monitoring (NEU)

Verzeichnisstruktur

Quick Start

Option 1: Docker Compose (Empfohlen)

Option 2: Manuelle Installation

Dashboard Import

Auto-Import (mit Provisioning)

Manueller Import

Verfügbare Metriken

Inference Metriken

Latenz Metriken

Throughput Metriken

GPU Metriken

Model Metriken

Cache Metriken

Scheduler Metriken

Error Metriken

Dashboard Panels

1. Inference Requests per Second

2. Latency Distribution

3. GPU Memory Usage

4. Throughput

5. Cache Hit Rate

PromQL Beispiele

Durchschnittliche Latenz (5min)

Erfolgsrate

GPU Speicher Auslastung

Tokens pro Request

Alerts

Latency Alerts

Error Alerts

GPU Alerts

System Alerts

Integration mit Code

LazyModelLoader

ContinuousBatchScheduler

GPU Memory Manager

Troubleshooting

Metriken werden nicht angezeigt

Dashboard zeigt keine Daten

Alerts feuern nicht

Best Practices

1. Scrape Interval

2. Retention

3. Label Cardinality

4. Recording Rules

SIEM Security Monitoring (NEU)

Verfügbare SIEM-Dashboards

Dokumentation

SIEM-relevante Alarme

Compliance-Reporting

Integration mit SIEM-Systemen

Support