No speedup. MacBook Pro 13, M2, 16 GB.
WhisperServer is a lightweight macOS menu bar app that runs in the background.
It exposes a local HTTP server compatible with the OpenAI Whisper API for audio transcription.
- Local HTTP server compatible with the OpenAI Whisper API
- Menu bar application (no Dock icon)
- Streaming via Server‑Sent Events (SSE) with automatic chunked fallback
- Automatic VAD-based chunking for Whisper models to prevent repeated text in long audio files — a common issue with standard whisper.cpp
- Automatically downloads models on first use
- Fast, high‑quality quantized models
- Parakeet model can transcribe ~1 hour of audio in about 1 minute
- macOS 14.6 or newer
- Apple Silicon (ARM64) only
| Project | Platform | Key features |
|---|---|---|
| VibeScribe | macOS | Automatic call summarization and transcription for meetings, interviews, and brainstorming. Key features: AI-powered summaries, easy export of notes, transcription. |
- Go to the Releases page.
- Download the latest
.dmgfile. - Open the
.dmgfile. - Drag WhisperServer to your Applications folder.
This app is not signed by Apple. To open it the first time:
- Control‑click (or right‑click) WhisperServer in Applications.
- Choose Open.
- In the warning dialog, click Open.
- Or go to System Settings → Privacy & Security and allow the app.
Example
1017.3.mp4
curl -X POST http://localhost:12017/v1/audio/transcriptions \
-F file=@/path/to/audio.mp3| Parameter | Description | Values | Required |
|---|---|---|---|
| file | Audio file | wav, mp3, m4a | yes |
| model | Model to use | model ID | no |
| prompt | Guide style/tone (Whisper) | string | no |
| response_format | Output format | json, text, srt, vtt, verbose_json | no |
| language | Input language (ISO 639‑1) | 2‑letter code | no |
| diarize | Enable Fluid speaker diarization | true, false (default false) | no |
| stream | Enable streaming (SSE or chunked) | true, false | no |
| Model | Relative speed | Quality |
|---|---|---|
parakeet-tdt-0.6b-v3 |
Fastest | Medium |
tiny-q5_1 |
Fast | Good (English), Low (other languages) |
large-v3-turbo-q5_0 |
Slow | Medium–Good |
medium-q5_0 |
Slowest | Good |
The server supports multiple response formats:
curl -X POST http://localhost:12017/v1/audio/transcriptions \
-F file=@/path/to/audio.mp3 \
-F response_format=json- json (default)
{
"text": "Transcription text."
}- verbose_json
{
"task": "transcribe",
"language": "en",
"duration": 10.5,
"text": "Full transcription text.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 5.0,
"text": "First segment.",
"tokens": [50364, 13, 11, 263, 6116],
"temperature": 0.0,
"avg_logprob": -0.45,
"compression_ratio": 1.275,
"no_speech_prob": 0.1
}
]
}- text
And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
- srt
1
00:00:00,240 --> 00:00:07,839
And so, my fellow Americans, ask not what your country can do for you
2
00:00:07,839 --> 00:00:10,640
ask what you can do for your country.
- vtt
WEBVTT
00:00:00.240 --> 00:00:07.839
And so, my fellow Americans, ask not what your country can do for you
00:00:07.839 --> 00:00:10.640
ask what you can do for your country.
WhisperServer supports real‑time streaming with automatic protocol detection. Note: timestamped streaming (srt, vtt, verbose_json) requires the Whisper provider; the Fluid provider streams text/JSON only.
If the client sends the header Accept: text/event-stream, the server uses SSE:
curl -X POST http://localhost:12017/v1/audio/transcriptions \
-H "Accept: text/event-stream" \
-F file=@audio.wav \
-F stream=true \
--no-bufferResponse format:
data: First transcribed segment
data:
data: Second transcribed segment
data:
event: end
data:
If SSE isn’t supported, the server falls back to HTTP chunked transfer encoding:
curl -X POST http://localhost:12017/v1/audio/transcriptions \
-F file=@audio.wav \
-F stream=true \
--no-bufferAdd speaker labels (who is talking) when you use the FluidAudio provider. Diarization is off by default to stay compatible with the OpenAI Whisper API.
How to enable:
- Select the Fluid provider in the menu bar (or pass the Fluid model ID), and
- Add
diarize=trueto your request.
Example:
curl -X POST http://localhost:12017/v1/audio/transcriptions \
-F file=@meeting.wav \
-F model=parakeet-tdt-0.6b-v3 \
-F response_format=json \
-F diarize=trueWhat you get:
- For
response_format=json, the server adds aspeaker_segmentsarray:{ "text": "Good morning everyone...", "speaker_segments": [ { "speaker": "Speaker_1", "start": 0.0, "end": 4.2, "text": "Good morning everyone" }, { "speaker": "Speaker_2", "start": 4.2, "end": 7.8, "text": "Morning! Shall we begin?" } ] } - For
response_format=verbose_json,speaker_segmentsis added as well. The existingsegmentsfield stays unchanged.
Streaming:
- Streaming sends one JSON chunk with
speaker_segmentswhen diarization completes. - Then the standard
endevent is sent.
If you want to build WhisperServer yourself:
- Clone the repository:
git clone https://github.com/pfrankov/whisper-server.git
cd whisper-server-
Open the project in Xcode.
-
Select your development team:
- Click the project in Xcode
- Select the WhisperServer target
- Go to "Signing & Capabilities"
- Choose your team
- Build and run:
- Press
Cmd + Rto build and run - Or use the menu: Product → Run
- Press
- Run the app, then run the script:
test_api.sh(complete API test suite)
- In the menu bar, open
Select Model→Import Whisper Model… - Choose a
.binmodel file (optionally add its.mlmodelcbundle in the same dialog) - The model becomes selectable in the menu and is listed in
GET /v1/models
MIT