VoxScribe is a lightweight, unified platform for testing and comparing multiple open-source speech-to-text (STT) models through a single interface. Born from real-world enterprise challenges where proprietary STT solutions become prohibitively expensive at scale, VoxScribe democratizes access to cutting-edge open-source alternatives.
Startups transcribing speech at scale face a common dilemma: cost vs. control. A contact center processing 100,000 hours of calls monthly can easily spend $150,000+ on transcription alone. While open-source STT models like Whisper, Voxtral, Parakeet, and Canary-Qwen now rival proprietary solutions in accuracy, evaluating them has been a nightmare:
- Dependency Hell π₯: Conflicting library versions between models (transformers version conflicts between Voxtral and NeMo models)
- Inconsistent APIs π: Each model requires different integration approaches
- Complex Setup βοΈ: Hours or days managing CUDA drivers, Python environments, and debugging
- Limited Comparison π: No unified way to test multiple models against your specific use cases
β
Unified Interface: Test 5+ open-source STT models through a single FastAPI backend and clean web UI
β
Dependency Management: Handles version conflicts and library incompatibilities automatically
β
Side-by-Side Comparison: Upload audio and compare transcriptions across multiple models
β
Model Caching: Intelligent caching for faster subsequent runs
β
Clean API: RESTful endpoints for easy integration into existing workflows
β
Cost Control: Self-hosted solution puts you in control of transcription costs
- OpenAI Whisper - Industry standard baseline [6-models]
- Mistral Voxtral - Latest transformer-based approach [2-models]
- NVIDIA Parakeet - Enterprise-grade accuracy [1-model]
- Canary-Qwen-2.5B - Multilingual capabilities [1-model]
- IBM-Granite-3.3 - Easy to add new models [2-models]
βββ backend.py # FastAPI backend with STT logic
βββ public/ # Frontend static files
β βββ index.html # Main HTML interface
β βββ styles.css # CSS styling with dark/light theme
β βββ app.js # JavaScript frontend logic
βββ run.py # Startup script
βββ requirements.txt # Python dependencies
- RESTful API for all STT operations
- Unified model management for Whisper, Voxtral, Parakeet, Canary
- Automatic dependency handling with version conflict resolution
- File upload and processing with background tasks
- Model comparison endpoint for side-by-side evaluation
- Dependency installation endpoints with subprocess management
- Modern responsive design with dark/light theme toggle
- Drag & drop file upload with audio preview
- Real-time status updates for dependencies and models
- Single model transcription with engine/model selection
- Multi-model comparison with checkbox selection
- Progress tracking and result visualization
- Download options for CSV and text formats
- AWS EC2 g6.xlarge instance with Amazon Linux 2023 6.1 or Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.8 (Amazon Linux 2023) [ Recommended_]
- NVIDIA GPU drivers installed
-
Install NVIDIA GRID drivers if using Amazon Linux 2023 6.1 else skip this step
# Follow AWS documentation for GRID driver installation # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver
-
Verify CUDA installation
nvidia-smi
-
Install system dependencies
sudo dnf update -y sudo dnf install git -y
-
Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
- Accept the license agreement (type
yes) - Confirm installation location (default is fine)
- Do you wish to update your shell profile to automatically initialize conda (type
yeswhen prompted)
- Accept the license agreement (type
-
Restart your shell or source bashrc
source ~/.bashrc
-
Create and activate conda environment
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r conda create -n voxscribe python=3.12 -y conda activate voxscribe
-
Install ffmpeg in Conda env
conda install ffmpeg -y
-
Clone the repository
git clone https://github.com/Fraser27/VoxScribe.git cd VoxScribe -
Install Python dependencies
pip install -r requirements.txt
-
Start the application
python run.py- Open your browser
http://localhost:8000
GET /api/status- Get system and dependency statusGET /api/models- Get available models and cache status
POST /api/transcribe- Single model transcriptionPOST /api/compare- Multi-model comparison
POST /api/install-dependency- Install missing dependencies
| Engine | Models | Dependencies | Features |
|---|---|---|---|
| Whisper | tiny, base, small, medium, large, large-v2, large-v3 | β Built-in | Detailed timestamps, multiple sizes |
| Voxtral | Mini-3B, Small-24B | transformers 4.56.0+ | Advanced audio understanding, multilingual |
| Parakeet | TDT-0.6B-V2 | NeMo toolkit | NVIDIA optimized, fast inference |
| Canary | Qwen-2.5B | NeMo toolkit | State-of-the-art English ASR |
The system automatically handles version conflicts between:
- Voxtral: Requires transformers 4.56.0+
- NeMo models: Require transformers 4.51.3
Installation buttons are provided in the UI for missing dependencies.
Supported audio formats: WAV, MP3, FLAC, M4A, OGG
# Run with auto-reload
uvicorn backend:app --reload --host 0.0.0.0 --port 8000Static files are served from the public/ directory. Changes to HTML, CSS, or JS files are reflected immediately.
- Update
MODEL_REGISTRYinbackend.py - Add loading logic in
load_model()function - Add transcription logic in
transcribe_audio()function
- No ScriptRunContext warnings - Clean separation eliminates context issues
- Better performance - FastAPI is faster and more efficient
- Modern UI - Custom HTML/CSS/JS with better UX
- API-first design - Can be integrated with other applications
- Easier deployment - Standard web application deployment
- Better error handling - Proper HTTP status codes and error responses
- Scalability - Can handle multiple concurrent requests
python run.pyuvicorn backend:app --host 0.0.0.0 --port 8000 --workers 4FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "backend:app", "--host", "0.0.0.0", "--port", "8000"]- Missing dependencies: Use the install buttons in the UI
- Model download failures: Check internet connection and disk space
- Audio processing errors: Ensure ffmpeg is installed
- CUDA issues: Check PyTorch CUDA installation
Server logs are displayed in the terminal where you run python run.py.
- Backend changes: Modify
backend.py - Frontend changes: Modify files in
public/ - New features: Add API endpoints and corresponding UI elements
- Testing: Use the built-in FastAPI docs at
/docs
