Skip to content

macOS menu bar app providing a local HTTP server compatible with the OpenAI Whisper API for fast and private audio transcription.

License

pfrankov/whisper-server

Repository files navigation

WhisperServer for macOS

Demo

No speedup. MacBook Pro 13, M2, 16 GB.

WhisperServer is a lightweight macOS menu bar app that runs in the background.
It exposes a local HTTP server compatible with the OpenAI Whisper API for audio transcription.

menu bar demo

Key features

  • Local HTTP server compatible with the OpenAI Whisper API
  • Menu bar application (no Dock icon)
  • Streaming via Server‑Sent Events (SSE) with automatic chunked fallback
  • Automatic VAD-based chunking for Whisper models to prevent repeated text in long audio files — a common issue with standard whisper.cpp
  • Automatically downloads models on first use
  • Fast, high‑quality quantized models
  • Parakeet model can transcribe ~1 hour of audio in about 1 minute

Requirements

  • macOS 14.6 or newer
  • Apple Silicon (ARM64) only

Recommended by

Project Platform Key features
VibeScribe macOS Automatic call summarization and transcription for meetings, interviews, and brainstorming. Key features: AI-powered summaries, easy export of notes, transcription.

Installation

Download from GitHub Releases

  1. Go to the Releases page.
  2. Download the latest .dmg file.
  3. Open the .dmg file.
  4. Drag WhisperServer to your Applications folder.

🚨 First launch

This app is not signed by Apple. To open it the first time:

  1. Control‑click (or right‑click) WhisperServer in Applications.
  2. Choose Open.
  3. In the warning dialog, click Open.
  4. Or go to System Settings → Privacy & Security and allow the app.

Usage

 Apple Shortcut

From Audio to SRT

Example

1017.3.mp4

HTTP

curl -X POST http://localhost:12017/v1/audio/transcriptions \
  -F file=@/path/to/audio.mp3

Supported parameters

Parameter Description Values Required
file Audio file wav, mp3, m4a yes
model Model to use model ID no
prompt Guide style/tone (Whisper) string no
response_format Output format json, text, srt, vtt, verbose_json no
language Input language (ISO 639‑1) 2‑letter code no
diarize Enable Fluid speaker diarization true, false (default false) no
stream Enable streaming (SSE or chunked) true, false no

Models

Model Relative speed Quality
parakeet-tdt-0.6b-v3 Fastest Medium
tiny-q5_1 Fast Good (English), Low (other languages)
large-v3-turbo-q5_0 Slow Medium–Good
medium-q5_0 Slowest Good

Response formats

The server supports multiple response formats:

curl -X POST http://localhost:12017/v1/audio/transcriptions \
  -F file=@/path/to/audio.mp3 \
  -F response_format=json
  1. json (default)
{
  "text": "Transcription text."
}
  1. verbose_json
{
  "task": "transcribe",
  "language": "en",
  "duration": 10.5,
  "text": "Full transcription text.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 5.0,
      "text": "First segment.",
      "tokens": [50364, 13, 11, 263, 6116],
      "temperature": 0.0,
      "avg_logprob": -0.45,
      "compression_ratio": 1.275,
      "no_speech_prob": 0.1
    }
  ]
}
  1. text
And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
  1. srt
1
00:00:00,240 --> 00:00:07,839
And so, my fellow Americans, ask not what your country can do for you

2
00:00:07,839 --> 00:00:10,640
ask what you can do for your country.
  1. vtt
WEBVTT

00:00:00.240 --> 00:00:07.839
And so, my fellow Americans, ask not what your country can do for you

00:00:07.839 --> 00:00:10.640
ask what you can do for your country.

Streaming support

WhisperServer supports real‑time streaming with automatic protocol detection. Note: timestamped streaming (srt, vtt, verbose_json) requires the Whisper provider; the Fluid provider streams text/JSON only.

Server‑Sent Events (SSE)

If the client sends the header Accept: text/event-stream, the server uses SSE:

curl -X POST http://localhost:12017/v1/audio/transcriptions \
  -H "Accept: text/event-stream" \
  -F file=@audio.wav \
  -F stream=true \
  --no-buffer

Response format:

data: First transcribed segment
data:

data: Second transcribed segment
data:

event: end
data:

Chunked response

If SSE isn’t supported, the server falls back to HTTP chunked transfer encoding:

curl -X POST http://localhost:12017/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F stream=true \
  --no-buffer

FluidAudio diarization

Add speaker labels (who is talking) when you use the FluidAudio provider. Diarization is off by default to stay compatible with the OpenAI Whisper API.

How to enable:

  • Select the Fluid provider in the menu bar (or pass the Fluid model ID), and
  • Add diarize=true to your request.

Example:

curl -X POST http://localhost:12017/v1/audio/transcriptions \
  -F file=@meeting.wav \
  -F model=parakeet-tdt-0.6b-v3 \
  -F response_format=json \
  -F diarize=true

What you get:

  • For response_format=json, the server adds a speaker_segments array:
    {
      "text": "Good morning everyone...",
      "speaker_segments": [
        {
          "speaker": "Speaker_1",
          "start": 0.0,
          "end": 4.2,
          "text": "Good morning everyone"
        },
        {
          "speaker": "Speaker_2",
          "start": 4.2,
          "end": 7.8,
          "text": "Morning! Shall we begin?"
        }
      ]
    }
  • For response_format=verbose_json, speaker_segments is added as well. The existing segments field stays unchanged.

Streaming:

  • Streaming sends one JSON chunk with speaker_segments when diarization completes.
  • Then the standard end event is sent.

Build from Source

If you want to build WhisperServer yourself:

  1. Clone the repository:
git clone https://github.com/pfrankov/whisper-server.git
cd whisper-server
  1. Open the project in Xcode.

  2. Select your development team:

    • Click the project in Xcode
    • Select the WhisperServer target
  • Go to "Signing & Capabilities"
  • Choose your team
  1. Build and run:
    • Press Cmd + R to build and run
    • Or use the menu: Product → Run

Testing

  • Run the app, then run the script: test_api.sh (complete API test suite)

Importing Custom Models

  • In the menu bar, open Select ModelImport Whisper Model…
  • Choose a .bin model file (optionally add its .mlmodelc bundle in the same dialog)
  • The model becomes selectable in the menu and is listed in GET /v1/models

License

MIT

About

macOS menu bar app providing a local HTTP server compatible with the OpenAI Whisper API for fast and private audio transcription.

Topics

Resources

License

Stars

Watchers

Forks