feat: Latency imporvements for buddy #455

swaroopvarma1 · 2025-12-29T07:32:13Z

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive latency tracking for voice conversations, measuring speech-to-text, language model, and text-to-speech performance metrics with statistical analysis and session summaries.
- Implemented LLM buffer streaming to enable parallel language model generation and speech synthesis, delivering faster response times.
- Added new configuration options to enable/disable latency tracking and buffer streaming features.
Documentation
- Added latency optimization guides and implementation roadmaps for improving voice agent responsiveness.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-29T07:32:25Z

Walkthrough

This pull request introduces a comprehensive latency tracking and optimization system for the Breeze Buddy voice agent. It adds latency tracking infrastructure (LatencyTracker, data models), frame processors for STT/LLM/TTS, buffered LLM streaming for parallel synthesis, configuration flags, and documentation.

Changes

Cohort / File(s)	Summary
Latency Tracking Core `app/ai/voice/agents/breeze_buddy/utils/latency_tracker.py`	New LatencyTracker class with TurnLatency and ComponentLatency data models. Tracks per-turn and per-component metrics (TTFB, total duration), computes percentiles (P50/P95/P99), exports to Langfuse, and logs summaries. ~250 lines.
Frame Processors `app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py`, `app/ai/voice/agents/breeze_buddy/processors/__init__.py`	Three FrameProcessor subclasses (STTLatencyProcessor, LLMLatencyProcessor, TTSLatencyProcessor) that instrument voice pipeline stages. Factory function create_latency_processors returns all three. Module exports via all.
LLM Buffer Streaming `app/ai/voice/agents/breeze_buddy/utils/llm_buffer_streaming.py`	BufferedLLMStreamWrapper class enables buffered streaming with configurable thresholds and word-boundary alignment. LLMBufferConfig provides AGGRESSIVE/BALANCED/CONSERVATIVE presets for latency/quality tradeoffs.
LLM Services Wrapper `app/ai/voice/agents/breeze_buddy/services/llm_wrapper.py`, `app/ai/voice/agents/breeze_buddy/services/__init__.py`	BreezeBuddyLLMWrapper extends AzureLLMService, conditionally routes streaming through BufferedLLMStreamWrapper based on ENABLE_BREEZE_BUDDY_LLM_BUFFER_STREAMING flag. Re-exported via services init.
Configuration `app/core/config/static.py`	Three new env-driven configuration flags: ENABLE_BREEZE_BUDDY_LATENCY_TRACKING (default true), ENABLE_BREEZE_BUDDY_LLM_BUFFER_STREAMING (default false), BREEZE_BUDDY_LLM_BUFFER_SIZE (default 40).
Documentation `docs/LATENCY_OPTIMIZATION.md`, `docs/BOLNA_VS_BREEZE_BUDDY_OPTIMIZATION_GAP_ANALYSIS.md`	Two markdown files: phased implementation guide for latency optimization features, and detailed gap analysis comparing Bolna vs. Breeze Buddy with prioritized roadmap and code recommendations.

Sequence Diagram(s)

sequenceDiagram
    participant AudioFrame as AudioRawFrame
    participant STTProc as STTLatencyProcessor
    participant LLMProc as LLMLatencyProcessor
    participant TTSProc as TTSLatencyProcessor
    participant Tracker as LatencyTracker
    
    Note over STTProc,Tracker: STT Phase
    AudioFrame->>STTProc: First frame arrives
    STTProc->>Tracker: Track STT start
    
    Note over STTProc,Tracker: Transcription arrives
    STTProc->>STTProc: Count frames, measure TTFB
    STTProc->>Tracker: track_component(STT, TTFB, duration, metadata)
    
    Note over LLMProc,Tracker: LLM Phase
    LLMProc->>Tracker: Track LLM start on LLMRunFrame
    LLMProc->>LLMProc: Capture first token time, count tokens
    LLMProc->>Tracker: track_component(LLM, TTFB, duration, metadata)
    
    Note over TTSProc,Tracker: TTS Phase
    TTSProc->>Tracker: Track TTS start on TTSStartedFrame
    TTSProc->>TTSProc: Measure audio chunks, bytes, TTFB
    TTSProc->>Tracker: track_component(TTS, TTFB, duration, audio_stats)
    TTSProc->>Tracker: end_turn() finalizes and computes total latency
    Tracker->>Tracker: Log summary with percentiles & export to Langfuse

sequenceDiagram
    participant Client as LLM Client
    participant LLMWrapper as BreezeBuddyLLMWrapper
    participant BaseStream as Base LLM Stream
    participant Buffer as BufferedLLMStreamWrapper
    participant TTS as TTS Consumer
    
    Note over Client,Buffer: LLM Streaming Flow (Buffer Enabled)
    Client->>LLMWrapper: _stream_chat_completions(context)
    LLMWrapper->>BaseStream: Request stream
    BaseStream-->>Buffer: Raw token chunks
    
    Note over Buffer: Buffer accumulation
    Buffer->>Buffer: Accumulate text in buffer
    Buffer->>Buffer: Check thresholds (buffer_size/min_buffer_size)
    Buffer->>Buffer: Align to word boundaries (optional)
    
    par Parallel Synthesis
        Buffer-->>TTS: Emit buffered chunk (TTFB improved)
        TTS->>TTS: Start TTS synthesis in parallel
    end
    
    Buffer->>Buffer: Yield remaining on completion
    Buffer-->>LLMWrapper: Final content
    LLMWrapper-->>Client: Streamed response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

badri-singhal
murdore

Poem

🐰 Hops excitedly

Latency tracked from start to end,
Frame processors, buffers blend,
STT, LLM, TTS unite,
Optimization shines so bright! ✨
Parallel streams and metrics flow,
Breeze Buddy steals the show! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title contains a typo ('imporvements' instead of 'improvements') and is vague/generic, failing to clearly convey the substantial changes including latency tracking, LLM buffer streaming, configuration updates, and documentation.	Fix the typo and make the title more specific. Consider: 'feat: Add latency tracking and LLM buffer streaming for Breeze Buddy' or similar to better reflect the actual implementation scope.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch breeze-buddy-latency-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-29T07:36:07Z

app/ai/voice/agents/breeze_buddy/utils/llm_buffer_streaming.py

+        if len(split_result) == 2:
+            # Found a word boundary
+            chunk_to_yield = split_result[0]
+            self.buffer = split_result[1]
+
+            # Only yield if chunk is substantial enough
+            if len(chunk_to_yield) >= self.min_buffer_size:
+                return chunk_to_yield


Preserve buffer when boundary chunk is too small

The word-boundary extraction mutates self.buffer before it knows whether it will emit a chunk. If the prefix is shorter than min_buffer_size, the method returns None after already assigning self.buffer = split_result[1], which permanently discards the prefix (and the separating space). Any stream where the last space falls early in the buffer will lose text and produce concatenated/incorrect speech. Consider only slicing the buffer once you’re sure the chunk will be emitted, or reattach the dropped prefix when skipping.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-12-29T07:36:08Z

app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py

+                logger.info(
+                    f"[LLM Latency] Turn {self.current_turn_id}: "
+                    f"TTFB={first_byte_latency:.0f}ms, "
+                    f"total={total_duration:.0f}ms, "


Guard LLM latency logging when no first token arrives

The log message formats first_byte_latency with .0f unconditionally. If no TTSSpeakFrame arrives before LLMFullResponseEndFrame (e.g., empty LLM response, TTS disabled, or a downstream error), first_byte_latency remains None and this f-string raises TypeError, interrupting frame processing. A fallback string/0 or a conditional log avoids crashing the pipeline in these cases.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (7)

app/ai/voice/agents/breeze_buddy/processors/__init__.py (1)
1-19: LGTM!

The package initialization correctly re-exports the latency tracking processors and factory function. The module structure is clean and follows standard Python packaging conventions.

The static analysis hint about sorting __all__ is a minor style preference that can be addressed optionally.
Optional: Sort __all__ alphabetically
 __all__ = [
+    "create_latency_processors",
     "STTLatencyProcessor",
     "LLMLatencyProcessor",
     "TTSLatencyProcessor",
-    "create_latency_processors",
 ]
app/ai/voice/agents/breeze_buddy/utils/llm_buffer_streaming.py (1)
143-178: Consider using ClassVar for mutable class attributes.

The static analysis correctly identifies that mutable class attributes (dicts) should be annotated with ClassVar to clarify they are class-level, not instance-level.
Suggested fix
+from typing import ClassVar, Dict
+
 class LLMBufferConfig:
     """Configuration for LLM buffer-based streaming."""

     # Aggressive (lowest latency, may cut words)
-    AGGRESSIVE = {
+    AGGRESSIVE: ClassVar[Dict[str, int | bool]] = {
         "buffer_size": 30,
         "min_buffer_size": 15,
         "enable_word_boundary": True
     }

     # Balanced (good latency, preserves words)
-    BALANCED = {
+    BALANCED: ClassVar[Dict[str, int | bool]] = {
         "buffer_size": 40,
         "min_buffer_size": 20,
         "enable_word_boundary": True
     }

     # Conservative (higher quality, slightly more latency)
-    CONSERVATIVE = {
+    CONSERVATIVE: ClassVar[Dict[str, int | bool]] = {
         "buffer_size": 60,
         "min_buffer_size": 30,
         "enable_word_boundary": True
     }
app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py (3)
68-68: Remove extraneous f prefix from string without placeholders.
-                logger.trace(f"[STT Latency] Audio input started")
+                logger.trace("[STT Latency] Audio input started")
263-263: Remove extraneous f prefix from string without placeholders.
-            logger.trace(f"[TTS Latency] TTS started")
+            logger.trace("[TTS Latency] TTS started")
178-181: Potential IndexError if frame.text is shorter than 50 characters.

String slicing in Python is safe for short strings (returns available characters), but the log message appends ... which may be misleading for short texts.
Suggested improvement
text_preview = frame.text[:50] + "..." if len(frame.text) > 50 else frame.text
logger.debug(
    f"[LLM Latency] First token received: {ttfb:.0f}ms, "
    f"text='{text_preview}'"
)
app/ai/voice/agents/breeze_buddy/utils/latency_tracker.py (1)
224-226: Use descriptive variable names instead of l.

The variable l is ambiguous and can be confused with 1 (one) in some fonts.
Suggested fix
         # Extract TTFB and total duration
-        ttfb_values = [l.first_byte_latency_ms for l in latencies if l.first_byte_latency_ms is not None]
-        total_values = [l.total_duration_ms for l in latencies if l.total_duration_ms is not None]
+        ttfb_values = [lat.first_byte_latency_ms for lat in latencies if lat.first_byte_latency_ms is not None]
+        total_values = [lat.total_duration_ms for lat in latencies if lat.total_duration_ms is not None]
docs/LATENCY_OPTIMIZATION.md (1)
91-103: Add language specifiers to fenced code blocks for proper syntax highlighting.

Several code blocks in the document are missing language specifiers, which affects readability when rendered.
Suggested fix for lines 91-103
-```
+```text
 User speaks → VAD detects end → STT finalizes → LLM starts → Response
                                   ↑
                             Waiting for complete transcript
With interim results:
- +text
User speaks → Interim results → LLM starts early → Response
↑
Processing begins while user still speaking
Similar fixes needed for lines 298, 327, 332, and 490.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef71043 and f0dcdc3.

📒 Files selected for processing (9)

app/ai/voice/agents/breeze_buddy/processors/__init__.py
app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py
app/ai/voice/agents/breeze_buddy/services/__init__.py
app/ai/voice/agents/breeze_buddy/services/llm_wrapper.py
app/ai/voice/agents/breeze_buddy/utils/latency_tracker.py
app/ai/voice/agents/breeze_buddy/utils/llm_buffer_streaming.py
app/core/config/static.py
docs/BOLNA_VS_BREEZE_BUDDY_OPTIMIZATION_GAP_ANALYSIS.md
docs/LATENCY_OPTIMIZATION.md

🧰 Additional context used

🧬 Code graph analysis (4)

app/ai/voice/agents/breeze_buddy/services/__init__.py (1)

app/ai/voice/agents/breeze_buddy/services/llm_wrapper.py (1)

BreezeBuddyLLMWrapper (22-48)

app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py (1)

app/ai/voice/agents/breeze_buddy/utils/latency_tracker.py (4)

LatencyTracker (49-362)

track_component (131-186)

start_turn (82-95)

end_turn (97-129)

app/ai/voice/agents/breeze_buddy/services/llm_wrapper.py (2)

app/ai/voice/agents/breeze_buddy/utils/llm_buffer_streaming.py (4)

BufferedLLMStreamWrapper (16-140)

LLMBufferConfig (143-178)

get_config (168-178)

stream_with_buffer (45-106)

app/ai/voice/agents/breeze_buddy/template/context.py (1)

context (72-74)

app/ai/voice/agents/breeze_buddy/processors/__init__.py (1)

app/ai/voice/agents/breeze_buddy/processors/latency_tracking.py (4)

STTLatencyProcessor (30-121)

LLMLatencyProcessor (124-220)

TTSLatencyProcessor (223-317)

create_latency_processors (320-357)

🪛 LanguageTool

docs/LATENCY_OPTIMIZATION.md

[grammar] ~3-~3: Ensure spelling is correct
Context: ...ucing voice conversation latency by 600-700ms** --- ## 📋 Table of Contents 1. [Overview](#ove...