Skip to content

Implement local dictation with WhisperKit + Hammerspoon #1

@ansonhoyt

Description

@ansonhoyt

Dictation Plan

Goal

Implement fast, accurate, local dictation on macOS for VS Code, Obsidian, and terminal workflows (Ghostty + tmux). Prefer streaming; hold-to-talk acceptable for MVP.

  • Hardware: Apple Silicon (M1 Max, M4)
  • English only
  • Local-only (low-latency, no network calls, no subscription costs)

Approach

  • Engine: whisper.cpp (Metal-accelerated, open source)
  • MVP: hold-to-talk → transcribe → clipboard → smart paste
  • Hotkey: Hammerspoon (global hotkey + per-app paste logic)
  • Later: streaming via CLI if needed

Phase 0 — Setup

# 1. Install deps (adds to .Brewfile)
brew install whisper-cpp hammerspoon sox

# 2. Run setup (checks deps, downloads model)
dictate --setup

# 3. Grant Hammerspoon permissions
#    System Settings → Privacy → Accessibility
#    System Settings → Privacy → Microphone

# 4. Reload Hammerspoon

Phase 1 — MVP (hold-to-talk)

Deliverables

  1. bin/dictate

    • --setup: check/install deps, download model (interactive)
    • --check: quick deps check (exit 0 if ready, 1 if not)
    • Default: record 16kHz mono WAV, transcribe, output plaintext to stdout
    • Exit codes: 0 success, 1 not ready, 2 recording failed, 3 transcription failed
    • Model: ggml-large-v3-turbo-q5_0.bin (quantized, ~600MB, best accuracy/speed)
    • Temp files: ~/.cache/dictate/ (delete after transcription)
    • Model files: ~/.cache/dictate/models/
    • Minimum recording: 0.5s (skip transcription if shorter, still play sounds)
  2. .hammerspoon/init.lua

    • Hold-to-talk: hold Fn to record (matches Wispr Flow PTT style)
    • On key down: play start sound, start sox recording, show menu bar indicator
    • On key up: play stop sound, stop sox, run transcription, copy to clipboard, smart paste
    • Menu bar icon: idle (⎯) / recording (●) / processing (⋯)
    • Audio feedback via hs.sound.getByName(): start=Tink, stop=Pop, error=Basso
    • On error (exit 1): play error sound, show notification "Run: dictate --setup"
    • Fn detection via hs.eventtap + flagsChanged (raw flag 0x800000)
    • Note: If using Wispr Flow, disable it or configure different hotkey in Phase 2
    • Future: check deps on Hammerspoon load, show ⚠ if not ready

Smart paste logic

Detect frontmost app and use appropriate paste method:

App Paste method
Ghostty Cmd+Shift+V (bracketed paste, works in tmux)
Terminal.app Cmd+V
iTerm2 Cmd+V
VS Code, Obsidian, others Cmd+V

Note: Bracketed paste works correctly in tmux. Defer tmux set-buffer detection to Phase 2 only if users report issues.

Acceptance criteria

  • Works reliably in:
    • VS Code, Obsidian (GUI apps)
    • Ghostty + tmux (terminal)
    • Claude Code, Gemini CLI, Copilot CLI, Codex (AI CLI tools in terminal)
  • Audio feedback on start/stop/error (no menu bar required for MVP)
  • Transcript pasted after releasing hotkey
  • Graceful failure if sox/whisper-cpp unavailable

Phase 2 — Quality pass

Enhancements

  • Post-processing substitutions:
    • "backtick" → `, "open paren" → (, "underscore" → _
    • Optional: "snake case" / "camel case" transforms
  • Technical vocabulary prompting (if whisper.cpp supports initial prompt)
  • Silence trim before transcription (reduce latency)
  • Config file: ~/.config/dictate/config.yml for substitutions, per-app rules, hotkey
  • Configurable hotkey (Fn, Hyper+D, custom)
  • tmux set-buffer paste if bracketed paste proves problematic

Acceptance criteria

  • Technical tokens improved vs baseline
  • Config file supports substitutions and hotkey customization

Phase 3 — Streaming evaluation

Goal

Validate real-time streaming via whisper.cpp --stream mode.

Tasks

  • Test streaming with whisper-cpp --stream
  • Evaluate partials vs final output format
  • Strategy: show partials in overlay, paste final on silence/endpoint

Acceptance criteria

  • Streaming observed live (or documented limitations)
  • Clear decision: proceed with streaming or stay with PTT

Phase 4 — Native app (only if needed)

Pursue only if Phase 3 streaming insufficient.

Implementation outline

  • Swift app with AVAudioEngine capture
  • whisper.cpp via C++ bindings or whisper-cpp-kit
  • VAD/endpointing (silence detection)
  • UI: menu bar + optional overlay for partials
  • Alternative: Hammerspoon hs.canvas overlay (lighter weight)

Acceptance criteria

  • Low-latency streaming dictation
  • Final transcript pasted once, no "partial spam"

Decisions made

  • Engine — whisper.cpp (open source, Metal-accelerated, no subscription)
  • Hold-to-talk (not toggle) — simpler, no state tracking
  • Fn-hold hotkey — matches Wispr Flow PTT, configurable in Phase 2
  • Smart paste — per-app detection, Cmd+Shift+V for Ghostty (bracketed paste works in tmux)
  • Modelggml-large-v3-turbo-q5_0.bin for English + technical terms
  • Paths — temp in ~/.cache/dictate/, config in ~/.config/dictate/ (XDG compliant)
  • Audio — Hammerspoon hs.sound.getByName(): Tink/Pop/Basso
  • Min duration — 0.5s, skip transcription if shorter
  • Menu bar indicator — required for recording feedback
  • Phase 4 conditional — only if streaming insufficient

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions