A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.
- 🏠 100% Local Processing: No cloud APIs, complete privacy
- 🚀 Apple Silicon Optimized: 15x+ real-time transcription speed
- 🎤 Speaker Diarization: Identify and separate multiple speakers
- 🎵 Universal Audio Support: Automatic conversion from MP3, M4A, FLAC, and more
- 📝 Multiple Output Formats: txt, json, vtt, srt, csv
- 💾 Low Memory Footprint: <2GB memory usage
- 🔧 TypeScript: Full type safety and modern development
- Node.js 18+
- whisper.cpp (
brew install whisper-cpp) - For audio format conversion: ffmpeg (
brew install ffmpeg) - automatically handles MP3, M4A, FLAC, OGG, etc. - For speaker diarization: Python 3.8+ and HuggingFace token (free)
- Native whisper.cpp formats: WAV, FLAC
- Auto-converted formats: MP3, M4A, AAC, OGG, WMA, and more
- Automatic conversion: Powered by ffmpeg with 16kHz/mono optimization for whisper.cpp
- Format detection: Automatic format detection and conversion when needed
git clone https://github.com/your-username/local-stt-mcp.git
cd local-stt-mcp/mcp-server
npm install
npm run build
# Download whisper models
npm run setup:models
# For speaker diarization, set HuggingFace token
export HF_TOKEN="your_token_here" # Get free token from huggingface.coSpeaker Diarization Note: Requires HuggingFace account and accepting pyannote/speaker-diarization-3.1 license.
Add to your MCP client configuration:
{
"mcpServers": {
"whisper-mcp": {
"command": "node",
"args": ["path/to/local-stt-mcp/mcp-server/dist/index.js"]
}
}
}| Tool | Description |
|---|---|
transcribe |
Basic audio transcription with automatic format conversion |
transcribe_long |
Long audio file processing with chunking and format conversion |
transcribe_with_speakers |
Speaker diarization and transcription with format support |
list_models |
Show available whisper models |
health_check |
System diagnostics |
version |
Server version information |
Apple Silicon Benchmarks:
- Processing Speed: 15.8x real-time (vs WhisperX 5.5x)
- Memory Usage: <2GB (vs WhisperX ~4GB)
- GPU Acceleration: ✅ Apple Neural Engine
- Setup: Medium complexity but superior performance
See /benchmarks/ for detailed performance comparisons.
mcp-server/
├── src/ # TypeScript source code
│ ├── tools/ # MCP tool implementations
│ ├── whisper/ # whisper.cpp integration
│ ├── utils/ # Speaker diarization & utilities
│ └── types/ # Type definitions
├── dist/ # Compiled JavaScript
└── python/ # Python dependencies
# Build
npm run build
# Development mode (watch)
npm run dev
# Linting & formatting
npm run lint
npm run format
# Type checking
npm run type-check- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.
- whisper.cpp for optimized inference
- OpenAI Whisper for the original models
- Model Context Protocol for the framework
- Pyannote.audio for speaker diarization