Skip to content

Add experimental Windows desktop client#5569

Open
WizardOfCommits wants to merge 1 commit intoBasedHardware:mainfrom
WizardOfCommits:feat/windows-desktop-client
Open

Add experimental Windows desktop client#5569
WizardOfCommits wants to merge 1 commit intoBasedHardware:mainfrom
WizardOfCommits:feat/windows-desktop-client

Conversation

@WizardOfCommits
Copy link

Summary
This PR adds an experimental Windows desktop client (WPF / .NET 9) that integrates with the existing Omi backend and Firebase-based auth, without modifying the existing mobile or macOS flows.

  • New windows/ directory with a WPF app (Omi.Windows.App) and a contributor‑oriented windows/README.md.
  • The Windows client authenticates via the existing /v1/auth endpoints (Python backend) and uses use_custom_token=true on /v1/auth/token to obtain a Firebase custom token.
  • It then exchanges that custom token for a Firebase ID token using the public Identity Toolkit API, and uses that ID token for all secured HTTP and WebSocket calls, including /v4/listen.

Auth flow (Windows)

  1. The app opens GET /v1/auth/authorize?provider=apple&redirect_uri=omi://auth/callback in the system browser.
  2. After the user signs in with Apple (or Google in the future), the Python backend handles the provider callback and renders auth_callback.html, which redirects to the app using omi://auth/callback?code=....
  3. The user pastes the code into the Windows app (current UX), which calls:
    • POST /v1/auth/token with grant_type=authorization_code, code, redirect_uri, and use_custom_token=true.
    • This returns a custom_token generated via firebase_admin.auth.create_custom_token on the server.
  4. The app calls:
    https://identitytoolkit.googleapis.com/v1/accounts:signInWithCustomToken?key=<FIREBASE_API_KEY>
    
    to obtain a standard Firebase idToken.
  5. The Windows client uses that idToken as Authorization: Bearer <idToken> for:
    • HTTP endpoints like /v1/users/profile, /v1/conversations/*
    • WebSocket /v4/listen (real-time transcription), which uses verify_id_token server‑side.

This mirrors the trust model of the existing mobile/desktop clients: the backend still verifies a real Firebase ID token, and the client does not need any privileged service account credentials (only the public Web API key).

Windows client behavior

  • “Connect to Omi” dialog:
    • Button to open the /v1/auth/authorize URL in the browser.
    • Textbox to paste the code from the omi://auth/callback?... URL.
  • Main window:
    • “Start recording” / “Stop” wired to a WASAPI/NAudio-based microphone capture.
    • Textbox showing raw JSON messages from /v4/listen (segments, events, or errors) for debugging and verification.
  • Hotkey + floating bar:
    • Global hotkey (Ctrl+Alt+O) toggles a simple floating bar window, similar in spirit to the macOS floating control bar but intentionally minimal for now.

Configuration

  • The Windows client reads:
    • OMI_API_BASE_URL (default https://api.omi.me/).
    • OMI_FIREBASE_API_KEY or FIREBASE_API_KEY (the public Firebase Web API key for the based-hardware Firebase project).
  • No secrets or service account JSON are embedded in the client. All privileged operations (provider token exchange, custom token generation, ID token verification) remain on the backend.

Known limitations / open questions

  • On https://api.omi.me today, /v4/listen is currently returning HTTP 403 for this flow, even though the client presents a Firebase ID token obtained via signInWithCustomToken. This suggests a configuration/deployment detail on the backend or Firebase side, rather than a client‑side issue.
  • The omi://auth/callback scheme is handled manually by asking the user to copy the code from the URL. A future iteration should register a proper Windows URI handler and parse the code automatically.
  • Only Apple is wired in the UI; Google is supported by the backend and can be exposed similarly.

Why this is safe

  • The Windows client only ever uses:
    • The existing public HTTP/WS endpoints (/v1/auth/*, /v4/listen, /v1/conversations, etc.).
    • The public Firebase Web API key (already used by the web frontend).
  • All privileged operations (OAuth token exchange, Firebase custom token generation, ID token verification) are done entirely on the server.
  • No secrets or service account credentials are stored in the repo or shipped in the client.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds an experimental WPF/.NET 9 Windows desktop client for Omi, along with an implementation of Firebase custom token generation in the existing Rust desktop backend. The Windows client follows the same Firebase ID token trust model as the mobile/macOS clients — the backend verifies the token server-side — and introduces a new windows/ directory that is entirely additive and does not touch the existing mobile or macOS flows.

Key changes and issues found:

  • WebSocket fragmented messages not handled (SttWebSocketClient.cs): The receive loop processes each ReceiveAsync call immediately as a complete message without checking result.EndOfMessage. Large transcript segments delivered across multiple WebSocket frames will produce corrupt/truncated JSON.
  • Unobserved fire-and-forget audio sends (AudioCaptureService.cs): Audio frames are dispatched via _ = handler(buffer) with no error observation or backpressure. Concurrent sends can pile up silently if the WebSocket is slow.
  • No token refresh (AuthService.cs): Firebase ID tokens expire after 1 hour. The RefreshToken returned by Identity Toolkit is never stored or used, so the app silently breaks for long sessions with no user-visible error or re-auth prompt.
  • ID token stored as plain text in the Windows Registry (AuthService.cs): The raw Firebase ID token is written to HKCU\Software\Omi\WindowsApp\IdToken without DPAPI/Credential Locker protection.
  • Service account read on every token request (auth.rs): The service account JSON file is read from disk and deserialized on every call to generate_custom_token; the credentials should be cached at startup.
  • PropertyChanged raised from a background thread (CaptureViewModel.cs): LastTranscript is updated from the WebSocket receive loop thread; the setter should marshal to the UI dispatcher.
  • All user-facing strings are in French (XAML and code-behind throughout windows/), which is inconsistent with the English-first convention of the rest of the codebase.

Confidence Score: 2/5

  • The PR is not safe to merge in its current form due to functional bugs that will cause the core transcription feature to break under normal usage.
  • The fragmented-WebSocket issue will silently corrupt transcript data for any message larger than a single frame; the lack of token refresh means the app stops working after 1 hour; and the fire-and-forget audio sends have no error handling. The Rust backend change (service account read on every call) is a performance regression on a shared endpoint. Together these represent multiple functionality and reliability gaps in the PR's core feature path.
  • Pay close attention to windows/App/Services/Transcription/SttWebSocketClient.cs (fragmented messages), windows/App/Services/Auth/AuthService.cs (token expiry + plain-text registry storage), and windows/App/Services/Audio/AudioCaptureService.cs (fire-and-forget sends).

Important Files Changed

Filename Overview
windows/App/Services/Transcription/SttWebSocketClient.cs WebSocket client for /v4/listen; receive loop does not reassemble fragmented messages (missing EndOfMessage check), and the receive task is fire-and-forgotten with unhandled exceptions.
windows/App/Services/Auth/AuthService.cs Implements the auth code → custom token → Firebase ID token flow. Critical gaps: no token refresh (tokens expire in 1 h), and the ID token is stored as plain text in the Windows Registry with no DPAPI protection.
windows/App/Services/Audio/AudioCaptureService.cs WASAPI/NAudio microphone capture at 16 kHz PCM16 mono. Audio frames are dispatched via a fire-and-forget async callback with no backpressure or error observation.
desktop/Backend-Rust/src/routes/auth.rs Implements Firebase custom token generation in Rust using service account credentials. The service account JSON is read from disk and parsed on every request rather than being cached at startup.
windows/App/ViewModels/CaptureViewModel.cs MVVM ViewModel wiring AudioCaptureService to SttWebSocketClient. PropertyChanged for LastTranscript is raised from a background thread without Dispatcher marshalling.
windows/App/MainWindow.xaml.cs Main window wires services manually; shows token dialog on first run. Floating bar is lazily instantiated. All button labels are in French.
windows/App/Services/Api/HttpApiClient.cs Generic HTTP client that attaches Bearer token, platform, and version headers. Clean implementation with no issues found.
windows/App/Omi.Windows.App.csproj WPF project targeting net9.0-windows with NAudio, System.Net.WebSockets.Client, and Microsoft.Toolkit.Uwp.Notifications dependencies.

Sequence Diagram

sequenceDiagram
    participant User
    participant WinApp as Windows App
    participant Browser
    participant Backend as Omi Backend
    participant Firebase as Firebase Identity Toolkit

    User->>WinApp: Click Open login page
    WinApp->>Browser: Open /v1/auth/authorize
    Browser->>Backend: GET /v1/auth/authorize
    Backend-->>Browser: Redirect to Apple OAuth
    User->>Browser: Sign in with Apple
    Browser->>Backend: OAuth provider callback
    Backend-->>Backend: Generate Firebase custom token RS256 JWT
    Backend-->>Browser: Render callback page with auth code
    User->>WinApp: Paste auth code manually

    WinApp->>Backend: POST /v1/auth/token (grant_type=authorization_code)
    Backend-->>WinApp: custom_token

    WinApp->>Firebase: POST accounts:signInWithCustomToken
    Firebase-->>WinApp: idToken plus refreshToken

    WinApp->>WinApp: Persist idToken to Windows Registry

    WinApp->>Backend: WSS /v4/listen with idToken
    Backend-->>WinApp: Transcript JSON frames

    WinApp->>Backend: HTTP /v1/users/profile with idToken
    Backend-->>WinApp: Profile data
Loading

Comments Outside Diff (7)

  1. windows/App/Services/Transcription/SttWebSocketClient.cs, line 949-965 (link)

    Fragmented WebSocket messages not reassembled

    ReceiveAsync can return a partial frame when a WebSocket message is fragmented across multiple frames (i.e. result.EndOfMessage == false). The current loop processes each partial read as a complete JSON message, which will cause JsonDocument.Parse to fail and only the raw bytes of the last fragment will be shown in the transcript.

    The loop should accumulate bytes until EndOfMessage == true before dispatching:

    private async Task ReceiveLoopAsync(ClientWebSocket socket, CancellationToken ct)
    {
        var buffer = new byte[64 * 1024];
        var messageBuffer = new System.IO.MemoryStream();
        while (!ct.IsCancellationRequested && socket.State == WebSocketState.Open)
        {
            WebSocketReceiveResult result;
            try
            {
                result = await socket.ReceiveAsync(buffer, ct).ConfigureAwait(false);
            }
            catch (Exception ex) when (!ct.IsCancellationRequested)
            {
                if (OnTranscriptJson is not null)
                    await OnTranscriptJson.Invoke($"{{\"type\":\"error\",\"message\":\"WebSocket receive failed: {ex.Message}\"}}").ConfigureAwait(false);
                break;
            }
    
            if (result.MessageType == WebSocketMessageType.Close)
                break;
    
            messageBuffer.Write(buffer, 0, result.Count);
    
            if (result.EndOfMessage)
            {
                var json = System.Text.Encoding.UTF8.GetString(messageBuffer.ToArray());
                messageBuffer.SetLength(0);
                if (OnTranscriptJson is not null)
                    await OnTranscriptJson.Invoke(json).ConfigureAwait(false);
            }
        }
    }
  2. windows/App/Services/Audio/AudioCaptureService.cs, line 538 (link)

    Unobserved fire-and-forget audio send tasks

    _ = handler(buffer) fires off an async send task without awaiting it. WaveInEvent.DataAvailable fires roughly every 100 ms (per BufferMilliseconds). If SttWebSocketClient.SendAudioAsync is slower than that (e.g. due to network backpressure or a closed socket), tasks pile up silently. Any WebSocketException thrown by SendAsync will also be swallowed because the task is unobserved.

    Since DataAvailable is a synchronous event handler, you cannot await directly, but you should at minimum observe failures:

    _ = handler(buffer).ContinueWith(
        t => { /* log or surface t.Exception */ },
        TaskContinuationOptions.OnlyOnFaulted);

    Or, better, enqueue the buffer to a System.Threading.Channels.Channel<byte[]> and drain it from a single dedicated async loop so ordering and backpressure are respected.

  3. windows/App/Services/Auth/AuthService.cs, line 588-597 (link)

    Firebase ID token never refreshed — silent auth failures after 1 hour

    Firebase ID tokens expire after 3,600 seconds (1 hour). The FirebaseSignInWithCustomTokenResponse includes a RefreshToken field, but it is never stored or used. GetIdTokenAsync always returns the cached (potentially expired) token.

    Once the token expires, the WebSocket connection (/v4/listen) and any authenticated HTTP calls will start receiving 401/403 errors silently — there is no automatic refresh, no user-visible error, and no prompt to re-authenticate.

    At minimum the token expiry timestamp should be tracked and GetIdTokenAsync should detect expiry and either trigger a refresh via the Firebase securetoken.googleapis.com/v1/token?grant_type=refresh_token endpoint using the stored refresh_token, or prompt the user to sign in again.

  4. windows/App/Services/Auth/AuthService.cs, line 582-584 (link)

    Firebase ID token stored in plain text in the Windows Registry

    The raw Firebase ID token (a signed JWT that grants full API access) is persisted as a plain registry string value under HKCU\Software\Omi\WindowsApp\IdToken. Any process running under the same user account, and registry-scanning tools, can read this value directly.

    The Windows platform provides ProtectedData (DPAPI) or the Windows Credential Locker (Windows.Security.Credentials.PasswordVault) for storing secrets. Either would protect the token from other processes without adding significant complexity.

  5. desktop/Backend-Rust/src/routes/auth.rs, line 31-32 (link)

    Service account file is read from disk on every token request

    tokio::fs::read_to_string(sa_path).await? and the subsequent serde_json::from_str are executed on every call to generate_custom_token. Under any meaningful load, this is repeated disk I/O and JSON deserialization that is entirely avoidable.

    The parsed service-account credentials (or the derived EncodingKey) should be loaded once at startup and stored in AppState (or similar) so that the hot path only signs the JWT in memory.

  6. windows/App/ViewModels/CaptureViewModel.cs, line 1171-1184 (link)

    LastTranscript updated from a background thread

    HandleTranscriptJsonAsync is invoked from ReceiveLoopAsync, which runs on a thread-pool thread via Task.Run. Setting LastTranscript raises PropertyChanged from that background thread. While WPF's binding engine marshals simple property-change notifications internally, relying on this is fragile and not guaranteed across WPF versions or when the binding changes.

    The safe pattern is to dispatch the update explicitly:

    private Task HandleTranscriptJsonAsync(string json)
    {
        string display;
        try { using var doc = JsonDocument.Parse(json); display = doc.RootElement.ToString(); }
        catch { display = json; }
    
        Application.Current.Dispatcher.InvokeAsync(() => LastTranscript = display);
        return Task.CompletedTask;
    }
  7. windows/App/FloatingBar/FloatingBarWindow.xaml, line 146 (link)

    UI strings are in French throughout the app

    The placeholder text "Posez une question à Omi..." — along with button labels ("Démarrer l'enregistrement", "Arrêter", "Valider", "Annuler") and inline dialog messages in TokenInputWindow.xaml, MainWindow.xaml, and MainWindow.xaml.cs — are all in French. The rest of the Omi codebase uses English for user-facing strings.

    This would affect any non-French speaker who tries the client. All user-facing strings should be in English (with optional i18n support added later), matching the project convention.

Last reviewed commit: d896e69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant