Author: Mr. Watson 🦄 Date: 2026-02-19
Monitor and manage GPU-intensive services (Whisper, RAG, Qwen3-TTS) with automatic lazy-loading and manual control to avoid VRAM exhaustion.
- GPU: NVIDIA RTX 2070 Super 8GB (via eGPU, USB-C/Thunderbolt)
- Total VRAM: 8192 MiB
- Services can NOT run simultaneously: Combined VRAM usage exceeds capacity
VRAM usage per service (approximate):
- Whisper (transcription + diarization): ~3.5 GB
- RAG library (embeddings + reranker): ~1.2 GB
- Qwen3-TTS 1.7B (voice cloning): ~4.3 GB
Combined: ~9 GB → exceeds 8 GB VRAM
GPU service behavior:
-
whisper-web.service(port 8060,/whisperendpoint)- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
-
qwen3-tts.service(port 8070,/ttsendpoint)- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
-
rag-library-ingest.service(SFTP inbox watcher)⚠️ Manual control: Must be started/stopped manually withgpu-service- Does NOT auto-start on boot
- Runs continuously when active (no auto-unload)
How it works:
- Service always running: FastAPI frontend available 24/7
- GPU model lazy-loads: Only loaded when first job arrives in queue
- Auto-unload on idle: After 120 seconds with no jobs, model is unloaded and VRAM freed
- Failsafe: If GPU OOM during load, job fails with clear error message
Example timeline:
00:00 - User visits https://beachlab.org/whisper/
00:01 - User uploads audio and clicks "Transcribe"
00:02 - Worker thread detects queued job
00:03 - GPU model begins loading (~10-20s first time)
00:22 - Model loaded, transcription starts
00:45 - Job completes, marked as 'done'
02:45 - No new jobs for 120s → model unloads, VRAM freed
Benefits:
- No 502 errors (frontend always available)
- No manual service management needed
- Efficient VRAM usage (only allocated when needed)
- Multiple users can queue jobs (processed sequentially)
/usr/local/bin/gpu-service — CLI tool for monitoring and manual control
gpu-service statusOutput:
- Service states (active/inactive)
- GPU memory usage per process
- Total VRAM used/available
gpu-service start whisper
gpu-service start rag
gpu-service start ttsImportant: Only start ONE service at a time.
gpu-service stop whisper
gpu-service stop rag
gpu-service stop ttsStop all:
gpu-service stop allTo switch from one GPU service to another:
gpu-service stop whisper
gpu-service start ttsWait 2-3 seconds between stop and start for VRAM cleanup.
Transcription job (automatic):
- Navigate to
https://beachlab.org/whisper/ - Upload audio and submit job
- GPU model loads automatically (first job may take 10-20s)
- Wait for job to complete
- Download transcript
- Model auto-unloads after 2 minutes of inactivity
Voice cloning (automatic):
- Navigate to
https://beachlab.org/tts/ - Upload reference audio + enter text
- GPU model loads automatically (first job may take 10-20s)
- Wait for generation to complete
- Download wav file
- Model auto-unloads after 2 minutes of inactivity
eBook indexing (manual):
- Check GPU status:
gpu-service status - If Whisper/TTS are idle, proceed. If not, wait or use
gpu-service stop all gpu-service start rag- Upload PDFs/EPUBs via SFTP to
/home/sftpuser/library_inbox - Monitor logs:
journalctl -u rag-library-ingest -f gpu-service stop rag(when inbox is empty)
Automatic (Whisper/TTS):
If you submit a job and GPU memory is full:
- Job will be marked as
failed - Error message: "GPU memory full. Please stop other GPU services (gpu-service stop all) and try again."
- Check
gpu-service statusto see what's using VRAM - Stop conflicting service or wait for auto-unload (120s idle)
Manual (RAG):
Before starting RAG, check for conflicts:
gpu-service statusIf Whisper or TTS are using GPU:
- Wait for auto-unload (check logs for "unloading model" message)
- Or force stop:
gpu-service stop all
Then start RAG:
gpu-service start ragsudo systemctl stop whisper-web rag-library-ingest qwen3-ttsOr kill GPU processes directly (last resort):
sudo pkill -9 -f "whisper-service|rag-library|qwen3-tts"- VRAM limit: 8GB is not enough to run all three services simultaneously
- Sporadic use: Whisper, RAG, and TTS are used infrequently, not 24/7
- Resource efficiency: GPU idle when not needed
- User experience: Frontends always accessible, no manual service management needed
Design decisions:
- ✅ Auto-loading (Whisper/TTS): Frontend always available, GPU loads on demand
- No CUDA OOM on startup (model loads when first job arrives)
- Auto-unload after idle timeout (frees VRAM for other services)
- Failsafe: if GPU memory full, job fails with clear message
⚠️ Manual control (RAG): Continuous processing when active- No auto-unload (watcher runs continuously until stopped)
- Requires explicit
gpu-service start ragbefore use - Prevents unexpected VRAM usage when uploading large batches
Alternative approaches considered but rejected:
- ❌ Smaller models: Qwen3-TTS 0.6B has noticeably lower quality
- ❌ Shared VRAM pool: Not supported by PyTorch/CUDA without full model unloading
- ❌ Always-on all services: Exceeds 8GB VRAM capacity
Context: The Razer Core X PSU died on 2026-03-04. The following services were disabled to avoid continuous errors and freezes.
| Service | State |
|---|---|
whisper-web |
disabled |
qwen3-tts |
disabled |
comfyui |
disabled |
nvidia-persistenced |
disabled |
egpu-watchdog |
disabled |
Telegraf inputs.nvidia_smi |
commented out |
1. Verify the GPU is visible:
lspci | grep -i nvidia
nvidia-smiIf nvidia-smi fails, load the driver manually:
sudo modprobe nvidia
nvidia-smi # should show the GPU without ERR!2. Re-enable GPU services:
sudo systemctl enable --now nvidia-persistenced
sudo systemctl enable --now whisper-web
sudo systemctl enable --now qwen3-tts
sudo systemctl enable --now comfyui
sudo systemctl enable --now egpu-watchdog.timer3. Re-enable Telegraf monitoring:
Edit /etc/telegraf/telegraf.d/nuc-timescale.conf and uncomment:
[[inputs.nvidia_smi]]
bin_path = "/usr/bin/nvidia-smi"
timeout = "5s"Then:
sudo systemctl restart telegraf
sudo journalctl -u telegraf -n 10 --no-pager | grep -E "Error|nvidia"4. Verify telemetry:
DRY_RUN=true bash /home/pink/.openclaw/workspace/scripts/publish_telemetry.sh | python3 -m json.tool | grep gpuThe gpu field should show real temp/util values instead of null.
5. Quick service test:
curl -s http://localhost:8060/health # whisper-web
curl -s http://localhost:8070/health # qwen3-tts
curl -s http://localhost:8188/ # comfyui