-
Notifications
You must be signed in to change notification settings - Fork 0
Open
0 / 10 of 1 issue completedLabels
enhancementNew feature or requestNew feature or request
Description
Dictation Plan
Goal
Implement fast, accurate, local dictation on macOS for VS Code, Obsidian, and terminal workflows (Ghostty + tmux). Prefer streaming; hold-to-talk acceptable for MVP.
- Hardware: Apple Silicon (M1 Max, M4)
- English only
- Local-only (low-latency, no network calls, no subscription costs)
Approach
- Engine: whisper.cpp (Metal-accelerated, open source)
- MVP: hold-to-talk → transcribe → clipboard → smart paste
- Hotkey: Hammerspoon (global hotkey + per-app paste logic)
- Later: streaming via CLI if needed
Phase 0 — Setup
# 1. Install deps (adds to .Brewfile)
brew install whisper-cpp hammerspoon sox
# 2. Run setup (checks deps, downloads model)
dictate --setup
# 3. Grant Hammerspoon permissions
# System Settings → Privacy → Accessibility
# System Settings → Privacy → Microphone
# 4. Reload HammerspoonPhase 1 — MVP (hold-to-talk)
Deliverables
-
bin/dictate--setup: check/install deps, download model (interactive)--check: quick deps check (exit 0 if ready, 1 if not)- Default: record 16kHz mono WAV, transcribe, output plaintext to stdout
- Exit codes: 0 success, 1 not ready, 2 recording failed, 3 transcription failed
- Model:
ggml-large-v3-turbo-q5_0.bin(quantized, ~600MB, best accuracy/speed) - Temp files:
~/.cache/dictate/(delete after transcription) - Model files:
~/.cache/dictate/models/ - Minimum recording: 0.5s (skip transcription if shorter, still play sounds)
-
.hammerspoon/init.lua- Hold-to-talk: hold Fn to record (matches Wispr Flow PTT style)
- On key down: play start sound, start sox recording, show menu bar indicator
- On key up: play stop sound, stop sox, run transcription, copy to clipboard, smart paste
- Menu bar icon: idle (⎯) / recording (●) / processing (⋯)
- Audio feedback via
hs.sound.getByName(): start=Tink, stop=Pop, error=Basso - On error (exit 1): play error sound, show notification "Run: dictate --setup"
- Fn detection via
hs.eventtap+flagsChanged(raw flag0x800000) - Note: If using Wispr Flow, disable it or configure different hotkey in Phase 2
- Future: check deps on Hammerspoon load, show ⚠ if not ready
Smart paste logic
Detect frontmost app and use appropriate paste method:
| App | Paste method |
|---|---|
| Ghostty | Cmd+Shift+V (bracketed paste, works in tmux) |
| Terminal.app | Cmd+V |
| iTerm2 | Cmd+V |
| VS Code, Obsidian, others | Cmd+V |
Note: Bracketed paste works correctly in tmux. Defer tmux set-buffer detection to Phase 2 only if users report issues.
Acceptance criteria
- Works reliably in:
- VS Code, Obsidian (GUI apps)
- Ghostty + tmux (terminal)
- Claude Code, Gemini CLI, Copilot CLI, Codex (AI CLI tools in terminal)
- Audio feedback on start/stop/error (no menu bar required for MVP)
- Transcript pasted after releasing hotkey
- Graceful failure if sox/whisper-cpp unavailable
Phase 2 — Quality pass
Enhancements
- Post-processing substitutions:
- "backtick" → `, "open paren" → (, "underscore" → _
- Optional: "snake case" / "camel case" transforms
- Technical vocabulary prompting (if whisper.cpp supports initial prompt)
- Silence trim before transcription (reduce latency)
- Config file:
~/.config/dictate/config.ymlfor substitutions, per-app rules, hotkey - Configurable hotkey (Fn, Hyper+D, custom)
- tmux
set-bufferpaste if bracketed paste proves problematic
Acceptance criteria
- Technical tokens improved vs baseline
- Config file supports substitutions and hotkey customization
Phase 3 — Streaming evaluation
Goal
Validate real-time streaming via whisper.cpp --stream mode.
Tasks
- Test streaming with
whisper-cpp --stream - Evaluate partials vs final output format
- Strategy: show partials in overlay, paste final on silence/endpoint
Acceptance criteria
- Streaming observed live (or documented limitations)
- Clear decision: proceed with streaming or stay with PTT
Phase 4 — Native app (only if needed)
Pursue only if Phase 3 streaming insufficient.
Implementation outline
- Swift app with AVAudioEngine capture
- whisper.cpp via C++ bindings or whisper-cpp-kit
- VAD/endpointing (silence detection)
- UI: menu bar + optional overlay for partials
- Alternative: Hammerspoon
hs.canvasoverlay (lighter weight)
Acceptance criteria
- Low-latency streaming dictation
- Final transcript pasted once, no "partial spam"
Decisions made
- Engine — whisper.cpp (open source, Metal-accelerated, no subscription)
- Hold-to-talk (not toggle) — simpler, no state tracking
- Fn-hold hotkey — matches Wispr Flow PTT, configurable in Phase 2
- Smart paste — per-app detection, Cmd+Shift+V for Ghostty (bracketed paste works in tmux)
- Model —
ggml-large-v3-turbo-q5_0.binfor English + technical terms - Paths — temp in
~/.cache/dictate/, config in~/.config/dictate/(XDG compliant) - Audio — Hammerspoon
hs.sound.getByName(): Tink/Pop/Basso - Min duration — 0.5s, skip transcription if shorter
- Menu bar indicator — required for recording feedback
- Phase 4 conditional — only if streaming insufficient
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request