Skip to content

feat(soniox): add Soniox real-time streaming STT provider#418

Open
DamianPala wants to merge 15 commits intoOpenWhispr:mainfrom
DamianPala:feat/soniox-streaming
Open

feat(soniox): add Soniox real-time streaming STT provider#418
DamianPala wants to merge 15 commits intoOpenWhispr:mainfrom
DamianPala:feat/soniox-streaming

Conversation

@DamianPala
Copy link
Contributor

@DamianPala DamianPala commented Mar 12, 2026

Summary

Adds Soniox as a fifth cloud STT provider. Soniox offers strong accuracy on English as well as Slavic and Eastern European languages, competitive pricing (significantly cheaper than Deepgram/AssemblyAI for comparable quality), and sub-second cold start (~250ms, no warmup connection needed).

Key additions:

  • Secondary language hints for mixed-language transcription (e.g. Polish + English in the same session), useful for multilingual users who code-switch
  • Full integration matching existing provider patterns: settings UI, onboarding, API key management, BYOK detection, icon, i18n (10 locales)

Also introduces the project's first unit tests (25 tests, Node built-in runner, zero new deps).

Changes

Core streaming (src/helpers/sonioxStreaming.js): New 375-line module. WebSocket connection to Soniox RT API, cold-start PCM buffering (3s at 16kHz), keepalive with 30s idle timeout, graceful finalization with drain. Includes text-level filler word cleanup to handle Soniox BPE tokenization artifacts.

IPC & audio (ipcHandlers.js, audioManager.js): Soniox handlers mirroring existing providers. isDestroyed() guards, cleanupAllStreaming() on app quit, defensive trim before paste.

UI (TranscriptionModelPicker.tsx, SettingsPage.tsx, OnboardingFlow.tsx): Soniox tab with API key input, model selection via registry, secondary language selector for mixed-language transcription. Unified with existing provider card pattern.

Tests (tests/helpers/sonioxStreaming.test.js): 25 tests for text processing using Node built-in test runner (zero new dependencies).

Test plan

  • npm test — 25 unit tests pass
  • Manual: Add Soniox API key in Settings → Soniox tab, select stt-rt-v4 model
  • Manual: Record speech with fillers ("uh", "um", "hmm") → verify they are stripped from transcript
  • Manual: Record speech starting with a filler → verify first letter is capitalized
  • Manual: Set secondary language (e.g. English + Polish), speak mixed-language → verify transcription
  • Manual: Verify no WebSocket leak after multiple start/stop cycles (check DevTools Network tab)
  • CI: Linux and Windows builds pass (build run)

@DamianPala DamianPala force-pushed the feat/soniox-streaming branch from 221b476 to 9d02380 Compare March 12, 2026 12:07
@DamianPala DamianPala marked this pull request as ready for review March 12, 2026 12:34
@DamianPala DamianPala force-pushed the feat/soniox-streaming branch from 9d02380 to 86990c2 Compare March 13, 2026 10:18
@gabrielste1n
Copy link
Collaborator

very cool thanks @DamianPala - will aim to review asap

@gabrielste1n gabrielste1n self-requested a review March 13, 2026 16:14
@DamianPala DamianPala marked this pull request as draft March 13, 2026 21:36
@alumpe
Copy link
Contributor

alumpe commented Mar 14, 2026

Soniox looks really great and the quality of their speech recognition is crazy good, I'd use this asap once its merged in!

@DamianPala
Copy link
Contributor Author

I am still testing this daily and ran into one issue: latency. Using the Soniox backend from Europe, it takes 400-500ms to open a WebSocket each time. That delay is noticeable compared to other providers that pre-open connections.

I added a configurable warm connection Stay connected for setting in the Soniox tab. When enabled, the WebSocket stays open between recordings so the next one starts instantly instead of ~500ms. Since Soniox charges for connection time (not just audio), each option shows the estimated cost increase. Default is Off.

Still testing real-world costs. Should be ready to merge in the next several days.

PS. I contacted Soniox with a feature request proposal but I don't think they will change it quicly.

image

Add Soniox as a fourth cloud streaming provider alongside Deepgram,
AssemblyAI, and OpenAI Realtime. Includes WebSocket streaming core with
cold-start buffering, full Electron IPC pipeline, settings UI with API
key management, onboarding validation, and BYOK detection.
- Remove Soniox-specific render branch in TranscriptionModelPicker,
  use same ModelCardList + API key maps as OpenAI/Groq/Mistral
- Replace hardcoded "stt-rt-v4" in UI with registry-based model selection
- Add Soniox "S" icon SVG (from official wordmark)
- Translate soniox_stt_rt_v4 model description in 9 locale files
When audioManager calls finalize() before disconnect(), the server has
already received it. Sending it again in drainFinalTokens() caused a 3s
timeout waiting for a response that would never come. Track finalize
state with _finalizeSent flag and skip the redundant call.
Soniox connects in ~250ms, no benefit from keeping an idle WebSocket
between dictation sessions. Avoids unnecessary Soniox session usage
and potential idle timeout issues.
- Remove closeResolve (never assigned, close handler check unreachable)
- Use getFullTranscript() instead of inline .map().join() duplicate
- Remove soniox special-case in handleCloudProviderChange (generic path handles it)
Soniox supports multi-language transcription via language_hints array.
Add a secondary language selector in the Soniox provider tab so users
can hint a second language (e.g. Polish + English) for code-switching.

- New sonioxSecondaryLanguage setting in store/hook
- LanguageSelector dropdown in Soniox tab (inline layout)
- Disabled when primary language is auto (no bias needed)
- Language codes normalized to base form (en-US → en)
- i18n keys added for all 10 locales
- Add 30s idle timeout to Soniox keepalive to prevent zombie WebSocket
  connections surviving renderer hot-reload or crash
- Add cleanupAllStreaming() to close all streaming backends on app quit
- Add isDestroyed() guards to Soniox and dictation IPC callbacks,
  matching the pattern used by Deepgram and AssemblyAI
- Prefer cleanupAll() over cleanup() for backends that support it
  (Deepgram, AssemblyAI) to also clean warm connections and timers
Soniox sends a U+FFFD replacement character as a final token when
recording silence, which gets pasted as garbage. Filter out empty,
whitespace-only, and replacement character tokens in Soniox handler.
Also trim finalText before the paste guard in audioManager as a
defensive check for all streaming providers.
Strip hesitation fillers (uh, um, yyy, eee, mmm, hmm) from assembled
transcript text. Soniox BPE tokenization splits fillers across sub-word
tokens, so removal works on joined text using word boundaries.

Capitalizes first letter after filler removal at sentence boundaries
(.!?) and at text start, with full Unicode support (Polish ć/ó/ś,
accented Latin, Cyrillic). Preserves real exclamations (Oh, Ah) and
words containing filler substrings (umbrella, human, summer).

Adds first test infrastructure (node:test, zero deps) with 25 tests.
Extract _drainCallback helper to eliminate near-identical
drainFinalTokens/drainSessionEnd methods. Add isValidToken predicate
for clearer token filtering, extract isExplicitLang to simplify
nested ternary, and log errors in cleanup catch blocks.
Pre-opens WebSocket between recordings to eliminate ~500ms
cold-start delay. Configurable idle timeout (30s-5min) with
cost estimates in UI. Falls back to cold-start on config
mismatch or connection loss.
The filler regex consumed periods after fillers ("word, uh. Next"
became "word Next"), merging sentences and losing capitalization.
The replacement function now checks whether a consumed period is
a sentence boundary or part of a standalone filler.

Also stops treating "hmm" as a filler since it carries intentional
meaning ("Hmm, interesting" vs hesitation noise like "uh" or "eee").
setSonioxKeepAliveTimeout was implemented in the store but missing
from the SettingsState interface. Also adds sonioxKeepAliveTimeout
to NUMERIC_SETTINGS so cross-window sync preserves the number type.
@DamianPala DamianPala force-pushed the feat/soniox-streaming branch from 86990c2 to f00cd6c Compare March 18, 2026 07:37
@DamianPala
Copy link
Contributor Author

Branch is rebased, tested, ready for review. Keep-alive works well in practice - cold start 400-800 ms (at my location), warm ~50-100 ms. I've been using it daily for a week without issues. Also fixed filler removal edge cases and added a few TS consistency fixes along the way.

Re CodeQL: false positive. It flags debugLogger.js:189 where meta (with WebSocket error details) is passed to console.log. Every other streaming backend does the same thing - they're just in the CodeQL baseline because the files are older. My file is new so it shows as a "new alert".

@DamianPala DamianPala marked this pull request as ready for review March 18, 2026 07:59
Soniox BPE emits spaces as standalone tokens (e.g. after punctuation).
The .trim() check in isValidToken rejected them, merging words across
punctuation: "No,ładne" instead of "No, ładne".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants