Skip to content

feat: add voice dictation via whisper.cpp#28

Closed
gapmiss wants to merge 1 commit intomy-claude-utils:mainfrom
gapmiss:feat/voice-dictation
Closed

feat: add voice dictation via whisper.cpp#28
gapmiss wants to merge 1 commit intomy-claude-utils:mainfrom
gapmiss:feat/voice-dictation

Conversation

@gapmiss
Copy link

@gapmiss gapmiss commented Mar 18, 2026

Summary

  • Hold-to-talk mic button on the context strip — records audio on the phone, transcribes locally via whisper.cpp, injects text into the terminal as stdin
  • Fully local — no cloud APIs, audio never leaves the machine
  • Optional feature — only activates when WHISPER_MODEL env var is set; zero impact otherwise

Changes

Server (packages/agent):

  • POST /api/transcribe endpoint — JWT-authenticated, accepts multipart audio via multer, converts to WAV via ffmpeg, runs whisper-cli, returns transcribed text
  • config.ts — added WHISPER_CPP_PATH and WHISPER_MODEL env vars

Client (packages/web):

  • useDictation hook — MediaRecorder with webm/opus + mp4/aac iOS fallback
  • Mic button in ContextStrip with hold-to-talk, recording/processing toast indicators, haptic feedback
  • Wired into TerminalView to send transcribed text as stdin

Other:

  • Added .ngrok-free.app to Vite allowedHosts (was missing, only had .ngrok-free.dev)
  • README updated with voice dictation setup instructions and configuration

Dependencies

  • multer (new server dependency for multipart upload)
  • Requires ffmpeg and whisper-cpp on the host (both available via brew)

Test plan

  • Tap mic button to start → "Recording..." toast appears, mic access granted
  • Tap mic button to stop → "Transcribing..." toast, then text appears in terminal
  • Tested on iOS Safari (PWA) and desktop Chrome
  • Verified no impact when WHISPER_MODEL is not set (mic button still shows, server returns 503 gracefully)
  • Typecheck passes on both packages

🤖 Generated with Claude Code

Hold-to-talk mic button on the context strip records audio on the phone,
sends it to the agent server, and transcribes it locally using whisper.cpp.
Transcribed text is injected into the terminal as stdin. Everything stays
local — no cloud APIs.

Server:
- POST /api/transcribe endpoint (JWT-authenticated, multipart audio via multer)
- Converts audio to 16kHz WAV via ffmpeg, runs whisper-cli, returns text
- Configurable via WHISPER_CPP_PATH and WHISPER_MODEL env vars

Client:
- useDictation hook (MediaRecorder with webm/opus + mp4/aac fallback)
- Mic button with hold-to-talk, recording/processing toast indicators
- Haptic feedback on recording start

Also fixes .ngrok-free.app missing from Vite allowedHosts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gapmiss gapmiss force-pushed the feat/voice-dictation branch from 9c8b97c to 795bef0 Compare March 18, 2026 02:47
@my-claude-utils
Copy link
Owner

Hey @gapmiss, thanks for the PR and for taking the time to build this out! Voice dictation via whisper.cpp is a genuinely cool idea, and the local-only approach fits well with clsh's philosophy.

That said, I'm going to close this one. The CLI is still very early and I want to keep the core lean and focused on the terminal experience for now. Adding external system dependencies (ffmpeg, whisper-cpp) and a new API surface is more than I'm comfortable merging without community discussion first.

I've opened a feature request issue so the community can weigh in. If there's enough interest, we can figure out the right way to integrate it (maybe as a plugin system down the road).

If you want to use this yourself in the meantime, a fork would work great. Appreciate the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants