Local voice dictation for macOS. Hold the fn key (configurable) to speak, release to transcribe. Works with any application.
100% on-device using WhisperKit or Parakeet - no cloud services, no data leaves your Mac.
User Guide — Complete documentation for setup, dictation, models, dictionary, AI refinement, and more.
- Push-to-talk or toggle dictation - Hold a hotkey to record, or double-press to toggle on/off
- Live transcription overlay - See your words appear in real time as you speak
- Multiple speech engines - WhisperKit (5 model sizes) and Parakeet v3
- AI text refinement - Built-in LLM or Ollama cleans up filler words and false starts
- Personal dictionary - Phonetic matching corrects names, jargon, and technical terms
- Transcription history - Browse, search, and export your last 500 transcriptions
- Custom hotkeys - Fn, Right Option, Right Command, Hyper Key, or any custom key combo
- Clipboard restoration - Automatically restores your clipboard after pasting
- 100% private - Everything runs locally on Apple Silicon, nothing leaves your Mac
Speak2 supports multiple Whisper model sizes plus Parakeet for multilingual use:
| Model | Size | Languages | Best For |
|---|---|---|---|
| Whisper tiny.en | ~75 MB | English only | Fastest, lowest resource usage |
| Whisper base.en | ~140 MB | English only | Recommended balance of speed/accuracy |
| Whisper small.en | ~460 MB | English only | Better accuracy |
| Whisper large-v3 | ~3 GB | 100+ languages | Best accuracy, multilingual |
| Whisper large-v3 turbo | ~954 MB | 100+ languages | Fast + accurate, multilingual |
| Parakeet v3 | ~600 MB | 25 languages | Alternative multilingual option |
You can download multiple models and switch between them from the menu bar. Only one model is loaded at a time to conserve memory.
By default, models are stored in ~/Library/Application Support/Speak2/Models/. You can change this location in Settings > Models if you prefer to store large models on an external drive or different location.
When changing the storage location, you'll be prompted to either move existing models to the new location or start fresh.
- macOS 14.0 or later
- Apple Silicon Mac (M1/M2/M3/M4)
Download the latest .dmg from the releases page and install.
git clone https://github.com/zachswift615/speak2.git
cd speak2/Speak2
# Install Metal toolchain (required once for MLX GPU shaders)
xcodebuild -downloadComponent MetalToolchain
# Build
xcodebuild build -scheme Speak2 -configuration Release -destination 'platform=macOS' \
-derivedDataPath .derivedDataNote: You must use
xcodebuild(notswift build) because the MLX dependency requires Xcode's build system to compile Metal shaders.swift buildwill compile but the app will crash at runtime when the built-in LLM feature is used.
.derivedData/Build/Products/Release/Speak2Tests can still use the Swift CLI since they don't exercise Metal:
swift testBefore building a release, update the version number in two places:
Sources/Models/AppState.swift— updateAppState.appVersion(displayed in Settings > General > About)Sources/Info.plist— updateCFBundleShortVersionStringandCFBundleVersion(used in the .app bundle for DMG distribution)
On first launch, a setup window will appear. You need to:
This is required for global fn key detection.
Click "Grant" next to Accessibility on the first launch window
Then click Open System Settings
Then find speak2 in the list and toggle the permission switch on and authenticate with password or fingerprint. If Speak2 is not in the list, click the + button and nagivate to your Applications directory where you dragged it to install, and Add Speak2 to the list of apps.
Option A: Add Speak2 directly
- Open System Settings > Privacy & Security > Accessibility
- Click the + button
- Press Cmd+Shift+G and navigate to the built binary (e.g.
.derivedData/Build/Products/Release/Speak2) - Select the Speak2 executable and enable it
Option B: Enable Terminal (easier for development)
- Open System Settings > Privacy & Security > Accessibility
- Find Terminal in the list and toggle it ON
- This allows any app run from Terminal to use accessibility features
Click "Grant" next to Microphone. And click "Allow" on the permission window that pops up.
Choose a model and click "Download". For most users, Whisper base.en (~140MB) is recommended as a good balance of speed and accuracy.
See the Speech Recognition Models section above for all available options.
Note: Large models (large-v3, large-v3 turbo) will prompt for confirmation before downloading due to their size. Parakeet takes longer to load initially (~20-30 seconds) as it compiles the neural engine model. Subsequent loads are faster. The menu bar icon will show a spinning indicator while loading.
Once all three items show checkmarks, the setup window will indicate completion and you can close it.
Note: Speak2 automatically detects when permissions are granted and will start the hotkey listener without a restart. In rare cases, macOS may not register the permission change immediately - if the hotkey doesn't respond after granting permissions, quit and relaunch Speak2.
- Hold the fn key - Recording starts (menu bar icon turns red, audio start sound plays)
- Speak - Say what you want to type (live transcription overlay shows your words in real time)
- Release fn key - Final transcription happens (icon shows spinner), text is refined if enabled, then pasted
The transcribed text is automatically pasted into whatever application text field has focus.
Speak2 supports two recording modes, configurable in Settings > General:
| Mode | How It Works |
|---|---|
| Hold (default) | Hold the hotkey to record, release to transcribe |
| Toggle | Press the hotkey twice to start recording, press twice again to stop and transcribe |
Toggle mode uses a 400ms window to detect the double-press. This is useful if you don't want to hold a key down for long dictations.
When enabled (Settings > General), a floating overlay appears at the bottom of your screen while recording. It shows your words in real time as you speak:
- Confirmed text appears in normal weight as the engine locks in words
- Unconfirmed text appears in italic as the engine processes your speech
- A pulsing red dot indicates active recording
- The panel auto-sizes and stays centered on screen
- Works across all spaces and full-screen apps
Live transcription uses streaming recognition - the same audio is also transcribed as a complete pass when you stop recording for maximum accuracy.
Speak2 runs as a menu bar app (no dock icon). Look for the microphone icon:
- White/Black (depending on macOS theme) - Idle, ready to record
- Yellow spinning arrows - Loading model
- Red mic - Recording in progress
- Cyan spinner - Transcribing
- Purple sparkles - AI refinement in progress
The menu shows a status line at the top indicating the current state (e.g., "Ready - Whisper (base.en)").
Click the menu bar icon and select Model to switch between downloaded models. Models not yet downloaded show a ↓ indicator - clicking them opens the setup window to download.
You can choose from several hotkey options in Settings > General or the menu bar Hotkey submenu:
| Hotkey | Description |
|---|---|
| Fn (default) | Function key |
| Right Option | Right Option/Alt key |
| Right Command | Right Command key |
| Hyper Key | Ctrl+Option+Cmd+Shift (all four modifiers) |
| Ctrl+Option+Space | Three-key combo |
| Custom | Any key or key+modifier combo you define |
Custom hotkeys let you record any key combination using a capture interface. You can save multiple custom combos and switch between them. Supports single keys, modifier-only triggers, and key+modifier combinations.
Sometimes external keyboards don't send the function key reliably. In that case, choose one of the alternative options.
Click Settings... (⌘,) from the menu bar to open the unified settings window with five tabs:
- General - Hotkey configuration, recording mode (hold/toggle), live transcription toggle, permissions, launch at login
- Models - Download, manage, and delete speech recognition models; configure storage location
- Dictionary - Manage your personal dictionary (add, edit, import/export words)
- History - Browse, search, and export your transcription history
- AI Refine - Configure optional AI post-processing to clean up transcriptions
Click Add Word... from the menu bar for quick dictionary word addition without opening the full settings window.
Speak2 includes a personal dictionary feature that helps improve transcription accuracy for names, technical terms, industry jargon, and unique spellings.
Accessing the Dictionary:
- Click the menu bar icon → Add Word... for quick word addition
- Open Settings > Dictionary for full dictionary management
Adding Words:
Each dictionary entry can include:
| Field | Required | Description |
|---|---|---|
| Word | Yes | The correct spelling you want |
| Aliases | No | Common misspellings or mishearings (comma-separated) |
| Pronunciation | No | Phonetic hint for words spelled differently than pronounced |
| Category | No | Organization (Names, Technical, Medical, etc.) |
| Language | Yes | Which language this word belongs to (25 languages supported) |
How It Works:
When you speak, the transcription is post-processed using your dictionary:
- Alias matching - Direct replacement of known misspellings (exact match, case-insensitive)
- Phonetic matching - Multiple algorithms catch similar-sounding words:
- Soundex - Traditional phonetic encoding
- Metaphone - Better handling of English pronunciation rules
- Fuzzy matching - Catches words with 70%+ similarity
Using the Pronunciation Field:
The pronunciation field helps when a word is spelled very differently from how it sounds. When set, phonetic matching uses the pronunciation hint instead of the word's spelling.
| Word | Pronunciation | Why |
|---|---|---|
| Nguyen | "Win" | Vietnamese name pronounced differently than spelled |
| Siobhan | "Shivon" | Irish name with non-obvious pronunciation |
| GIF | "Jif" | If you prefer the soft G pronunciation |
| SQL | "Sequel" | Matches the spoken acronym |
Examples:
| Scenario | Word | Aliases | Pronunciation |
|---|---|---|---|
| Company name | Anthropic | Antropik, Anthropik | (not needed - sounds like spelling) |
| Technical term | Kubernetes | Cooper Netties, Kubernetties | (not needed) |
| Person's name | Siobhan | Shivon, Shavon | Shivon |
| Acronym | AWS | Amazon Web Services |
For most words, you won't need the pronunciation field - phonetic matching will handle common mishearings automatically. Use it only when the spelling is very different from the sound.
Right-Click Service:
You can also add words directly from any application:
- Select/highlight any text
- Right-click → Services → Add to Speak2 Dictionary
- Choose to add as a new word or as an alias to an existing word
Note: The service may require logging out and back in to appear after first install.
Import/Export:
The dictionary can be exported to JSON and imported on another machine via Settings > Dictionary.
Speak2 keeps a history of your last 500 transcriptions, grouped by date (Today, Yesterday, Last 7 Days, etc.).
- Open Settings > History to browse, search, and export your transcription history
- Click the copy icon on any entry to copy it to your clipboard
- Use the model filter dropdown to show only transcriptions from a specific model
- Long transcriptions show a "Show More" toggle to expand the full text
- Each entry records the text, timestamp, model used, language, and audio length
History is stored locally at ~/Library/Application Support/Speak2/transcription_history.json.
Speak2 can optionally clean up transcribed text using an LLM before pasting. This removes filler words, false starts, repetitions, and verbal noise - entirely on-device, no cloud required.
Open Settings > AI Refine to choose a mode:
| Mode | Description |
|---|---|
| Off | No refinement - raw transcription is pasted directly |
| Built-in (recommended) | Downloads a small LLM (~1.1 GB) that runs locally via MLX |
| External Server (Ollama) | Sends text to a local Ollama instance for processing |
Built-in mode:
- Select Built-in (recommended) in Settings > AI Refine
- Click Download Model - downloads Qwen 2.5 1.5B Instruct (~1.1 GB) once
- A green "Ready" checkmark appears when the model is cached
No additional software required. The model downloads to ~/Library/Caches/huggingface/hub/ and runs on Apple Silicon GPU via MLX.
External Server (Ollama) mode:
For users who prefer to use their own model via Ollama:
- Install and run Ollama, pull a model (e.g.
ollama pull gemma3:4b) - Select External Server (Ollama) in Settings > AI Refine
- Set the Server URL (default:
http://localhost:11434) and Model Name - Click Test Connection to verify
How it works:
After transcription (and dictionary post-processing), the text is sent to the selected LLM with a cleanup prompt. The refined result is pasted instead of the raw transcription. If refinement fails for any reason, Speak2 silently falls back to the original transcription so dictation is never interrupted.
During refinement the menu bar icon shows a purple sparkles symbol and the status reads "Refining with AI...".
Custom prompt:
The default prompt instructs the model to clean up transcription without adding commentary. You can replace it with anything - for example a prompt that formats output as bullet points, translates to another language, or applies domain-specific corrections. Leave the field empty to restore the default. The prompt is shared between both built-in and external modes.
Toggle this option in Settings > General.
Click the menu bar icon and click "Quit Speak2".
- HotkeyManager - Detects hotkey press/release (hold mode) or double-press (toggle mode) using CGEvent tap
- AudioRecorder - Captures microphone audio at 16kHz mono PCM via AVAudioEngine
- ModelManager - Handles model downloading, loading, switching, and dispatching to the active engine
- WhisperTranscriber - Runs WhisperKit on-device for speech-to-text (supports streaming and file-based transcription)
- ParakeetTranscriber - Runs FluidAudio/Parakeet on-device for speech-to-text (supports streaming and file-based transcription)
- DictationController - Orchestrates the full record → stream → transcribe → refine → paste pipeline
- DictionaryProcessor - Post-processes transcription using personal dictionary (alias replacement + phonetic matching)
- MLXRefiner - Built-in LLM refinement using MLX (Qwen 2.5 1.5B Instruct); downloads and runs on-device
- OllamaRefiner - External LLM refinement via a local Ollama server; falls back to original text on any error
- LiveTranscriptionPanel - Floating overlay that displays streaming transcription results in real time
- TranscriptionHistoryStorage - Persists transcription history to local JSON (up to 500 entries)
- TextInjector - Copies transcription to clipboard, simulates Cmd+V to paste, then restores original clipboard contents
Both transcription engines implement a StreamingTranscriptionEngine protocol for live transcription, using a custom AVAudioEngine-based sliding-window approach for real-time results. A word-level diff algorithm separates confirmed from unconfirmed text in the overlay.
The selected model stays loaded in memory (~300-600MB RAM depending on model) for instant transcription.
- Speak naturally with punctuation inflection - Whisper handles periods, commas, and question marks based on your tone
- Keep recordings under 30 seconds for best performance
- First transcription may be slightly slower as the model warms up
- Enable live transcription to see your words as you speak - useful for catching errors early
- Add frequently used names and technical terms to your personal dictionary for better accuracy
- Use aliases for words that are commonly misheard (e.g., add "Kubernetes" with alias "Cooper Netties")
- Try the built-in AI refinement to automatically clean up filler words like "um", "uh", and false starts
- Use toggle mode for longer dictations so you don't have to hold the key down
If you upgraded from an earlier version of Speak2, your models may have been stored at a legacy location (~/Documents/huggingface). The app attempts to migrate these automatically, but if you experience issues:
Quick fix:
- Open Settings > Models
- Delete the affected model (trash icon)
- Re-download it
Manual cleanup (if needed):
# Remove any orphaned model files at the old location
rm -rf ~/Documents/huggingface
# Remove incorrectly migrated files (if present)
rm -rf ~/Library/Application\ Support/Speak2/Models/huggingface
# Reset migration flag to trigger fresh migration on next launch
defaults delete com.zachswift.speak2 didAttemptLegacyMigrationV2Then restart Speak2. If you had models at the legacy location, they'll be migrated to the correct path.
Try clicking on the model in Settings > Models to reload it. If that doesn't work, delete and re-download the model.
Speak2 automatically detects permission changes and starts the hotkey listener. If the hotkey still doesn't respond after granting both Accessibility and Microphone permissions, quit and relaunch Speak2. macOS occasionally requires a restart for CGEvent tap permissions to take effect.
If a multilingual Whisper model (large-v3, large-v3 turbo) is translating your speech to English instead of transcribing it in the original language, this has been fixed in v1.5.0. Update to the latest version.
- Parakeet model takes ~20-30 seconds to load on first use (compiling neural engine model)
- Uses clipboard for text injection (briefly swaps clipboard contents, then restores them)
- fn key detection requires Accessibility permission
- Only tested on Apple Silicon Macs
- Live transcription streaming accuracy may differ slightly from the final transcription pass
- Swift + SwiftUI
- WhisperKit - Apple's optimized Whisper implementation
- FluidAudio - Parakeet speech recognition for Apple Silicon
- MLX Swift - On-device LLM inference for built-in text refinement
- AVFoundation / AVAudioEngine for audio capture and streaming
- CGEvent for global hotkey detection
- Accelerate (vDSP) for audio level analysis
- Listen2 — On-device text-to-speech for macOS. The companion to Speak2.
MIT