-
Notifications
You must be signed in to change notification settings - Fork 540
Open
Description
Description
A user reported issues when using OpenAI gpt-4o-transcribe for live transcription.
Analysis
The default live model for OpenAI is gpt-4o-transcribe (crates/owhisper-client/src/providers.rs:275).
Missing word-level timestamps in live mode
In the live adapter (crates/owhisper-client/src/adapter/openai/live.rs), build_transcript_response creates words by splitting whitespace but sets start: 0.0 and end: 0.0 for every word — OpenAI's Realtime API does not provide word-level timestamps for transcription. This means every response has start: 0.0, duration: 0.0, which could cause issues with:
- Subtitle display and word highlighting
- Transcript alignment
- Diarization / speaker assignment
Other potential issues
- Server-side VAD settings (threshold=0.5, silence_duration=500ms) may be too aggressive or not aggressive enough for certain environments
- The
channel_indexis hardcoded to[0, 1]regardless of actual channel mode - The
includefield requestsitem.input_audio_transcription.logprobswhich may not be supported for all models - For batch transcription,
gpt-4o-transcribecorrectly usesjsonformat instead ofverbose_json, but this means no word-level timing data
Relevant Files
crates/owhisper-client/src/adapter/openai/live.rs— OpenAI Realtime API live adaptercrates/owhisper-client/src/adapter/openai/batch.rs— OpenAI batch transcription adaptercrates/owhisper-client/src/providers.rs— Provider configuration and default modelscrates/owhisper-client/src/adapter/parsing.rs— Word builder and time span calculation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Backlog