Skip to content

Fraser27/VoxBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VoxScribe πŸŽ™οΈ

VoxScribe: A platform to test Opensource Speech-to-Text models

VoxScribe is a lightweight, unified platform for testing and comparing multiple open-source speech-to-text (STT) models through a single interface. Born from real-world enterprise challenges where proprietary STT solutions become prohibitively expensive at scale, VoxScribe democratizes access to cutting-edge open-source alternatives.

The Problem We Solve

Startups transcribing speech at scale face a common dilemma: cost vs. control. A contact center processing 100,000 hours of calls monthly can easily spend $150,000+ on transcription alone. While open-source STT models like Whisper, Voxtral, Parakeet, and Canary-Qwen now rival proprietary solutions in accuracy, evaluating them has been a nightmare:

  • Dependency Hell πŸ”₯: Conflicting library versions between models (transformers version conflicts between Voxtral and NeMo models)
  • Inconsistent APIs πŸ”„: Each model requires different integration approaches
  • Complex Setup βš™οΈ: Hours or days managing CUDA drivers, Python environments, and debugging
  • Limited Comparison πŸ“Š: No unified way to test multiple models against your specific use cases

What VoxScribe Offers

βœ… Unified Interface: Test 5+ open-source STT models through a single FastAPI backend and clean web UI
βœ… Dependency Management: Handles version conflicts and library incompatibilities automatically
βœ… Side-by-Side Comparison: Upload audio and compare transcriptions across multiple models
βœ… Model Caching: Intelligent caching for faster subsequent runs
βœ… Clean API: RESTful endpoints for easy integration into existing workflows
βœ… Cost Control: Self-hosted solution puts you in control of transcription costs

Supported Models

  • OpenAI Whisper - Industry standard baseline [6-models]
  • Mistral Voxtral - Latest transformer-based approach [2-models]
  • NVIDIA Parakeet - Enterprise-grade accuracy [1-model]
  • Canary-Qwen-2.5B - Multilingual capabilities [1-model]
  • IBM-Granite-3.3 - Easy to add new models [2-models]

Architecture

β”œβ”€β”€ backend.py          # FastAPI backend with STT logic
β”œβ”€β”€ public/             # Frontend static files
β”‚   β”œβ”€β”€ index.html      # Main HTML interface
β”‚   β”œβ”€β”€ styles.css      # CSS styling with dark/light theme
β”‚   └── app.js          # JavaScript frontend logic
β”œβ”€β”€ run.py              # Startup script
└── requirements.txt    # Python dependencies

Watch the video

Features

Backend (FastAPI)

  • RESTful API for all STT operations
  • Unified model management for Whisper, Voxtral, Parakeet, Canary
  • Automatic dependency handling with version conflict resolution
  • File upload and processing with background tasks
  • Model comparison endpoint for side-by-side evaluation
  • Dependency installation endpoints with subprocess management

Frontend (HTML/CSS/JS)

  • Modern responsive design with dark/light theme toggle
  • Drag & drop file upload with audio preview
  • Real-time status updates for dependencies and models
  • Single model transcription with engine/model selection
  • Multi-model comparison with checkbox selection
  • Progress tracking and result visualization
  • Download options for CSV and text formats

Quick Start

Prerequisites

  • AWS EC2 g6.xlarge instance with Amazon Linux 2023 6.1 or Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.8 (Amazon Linux 2023) [ Recommended_]
  • NVIDIA GPU drivers installed

Installation Steps

  1. Install NVIDIA GRID drivers if using Amazon Linux 2023 6.1 else skip this step

    # Follow AWS documentation for GRID driver installation
    # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver
  2. Verify CUDA installation

    nvidia-smi
  3. Install system dependencies

    sudo dnf update -y
    sudo dnf install git -y
  4. Install Miniconda

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh
    • Accept the license agreement (type yes)
    • Confirm installation location (default is fine)
    • Do you wish to update your shell profile to automatically initialize conda (type yes when prompted)
  5. Restart your shell or source bashrc

    source ~/.bashrc
  6. Create and activate conda environment

    conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
    conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
    conda create -n voxscribe python=3.12 -y
    conda activate voxscribe
  7. Install ffmpeg in Conda env

    conda install ffmpeg -y
  8. Clone the repository

    git clone https://github.com/Fraser27/VoxScribe.git
    cd VoxScribe
  9. Install Python dependencies

    pip install -r requirements.txt
  10. Start the application

python run.py
  1. Open your browser
    http://localhost:8000
    

API Endpoints

System Status

  • GET /api/status - Get system and dependency status
  • GET /api/models - Get available models and cache status

Transcription

  • POST /api/transcribe - Single model transcription
  • POST /api/compare - Multi-model comparison

Dependencies

  • POST /api/install-dependency - Install missing dependencies

Model Support

Engine Models Dependencies Features
Whisper tiny, base, small, medium, large, large-v2, large-v3 βœ… Built-in Detailed timestamps, multiple sizes
Voxtral Mini-3B, Small-24B transformers 4.56.0+ Advanced audio understanding, multilingual
Parakeet TDT-0.6B-V2 NeMo toolkit NVIDIA optimized, fast inference
Canary Qwen-2.5B NeMo toolkit State-of-the-art English ASR

Dependency Management

The system automatically handles version conflicts between:

  • Voxtral: Requires transformers 4.56.0+
  • NeMo models: Require transformers 4.51.3

Installation buttons are provided in the UI for missing dependencies.

File Support

Supported audio formats: WAV, MP3, FLAC, M4A, OGG

Development

Backend Development

# Run with auto-reload
uvicorn backend:app --reload --host 0.0.0.0 --port 8000

Frontend Development

Static files are served from the public/ directory. Changes to HTML, CSS, or JS files are reflected immediately.

Adding New Models

  1. Update MODEL_REGISTRY in backend.py
  2. Add loading logic in load_model() function
  3. Add transcription logic in transcribe_audio() function

Benefits over Streamlit

  1. No ScriptRunContext warnings - Clean separation eliminates context issues
  2. Better performance - FastAPI is faster and more efficient
  3. Modern UI - Custom HTML/CSS/JS with better UX
  4. API-first design - Can be integrated with other applications
  5. Easier deployment - Standard web application deployment
  6. Better error handling - Proper HTTP status codes and error responses
  7. Scalability - Can handle multiple concurrent requests

Deployment

Local Development

python run.py

Production

uvicorn backend:app --host 0.0.0.0 --port 8000 --workers 4

Docker (Optional)

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "backend:app", "--host", "0.0.0.0", "--port", "8000"]

Troubleshooting

Common Issues

  1. Missing dependencies: Use the install buttons in the UI
  2. Model download failures: Check internet connection and disk space
  3. Audio processing errors: Ensure ffmpeg is installed
  4. CUDA issues: Check PyTorch CUDA installation

Logs

Server logs are displayed in the terminal where you run python run.py.

Contributing

  1. Backend changes: Modify backend.py
  2. Frontend changes: Modify files in public/
  3. New features: Add API endpoints and corresponding UI elements
  4. Testing: Use the built-in FastAPI docs at /docs

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published