Skip to content

Comments

Enhance ASR services configuration and provider management#289

Closed
AnkushMalaker wants to merge 1 commit intodevfrom
feat/vibevoice-asr-2
Closed

Enhance ASR services configuration and provider management#289
AnkushMalaker wants to merge 1 commit intodevfrom
feat/vibevoice-asr-2

Conversation

@AnkushMalaker
Copy link
Collaborator

@AnkushMalaker AnkushMalaker commented Feb 6, 2026

  • Updated config.yml.template to include capabilities for ASR providers, detailing features like word timestamps and speaker segments.
  • Added a new vibevoice provider configuration for Microsoft VibeVoice ASR, supporting speaker diarization.
  • Enhanced .env.template with clearer provider selection and model configuration options, including CUDA settings and voice activity detection.
  • Improved docker-compose.yml to support multiple ASR providers with detailed service configurations.
  • Introduced common utilities for audio processing and ASR service management in the common module, enhancing code reusability and maintainability.
  • Updated README.md to reflect the new provider-based architecture and usage instructions for starting different ASR services.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for multiple ASR providers (Faster-Whisper, Transformers, NeMo, VibeVoice)
    • Introduced VibeVoice provider with native speaker diarization capabilities
    • Unified ASR API across all providers with standardized /transcribe, /health, and /info endpoints
    • Added interactive provider and model selection wizard
  • Configuration

    • Redesigned environment variable management for provider-specific settings
    • Added pre-configured deployment profiles for each ASR provider
    • Updated Docker Compose setup to support multi-provider ASR services
  • Documentation

    • Added comprehensive provider documentation with quick-start guides and integration examples
    • Included API endpoint documentation with usage examples
    • Added provider-specific configuration and performance notes
  • Tests

    • Added ASR provider test suites covering error handling, protocol validation, and GPU integration
    • Implemented mock ASR server for CI/CD testing

- Updated `config.yml.template` to include capabilities for ASR providers, detailing features like word timestamps and speaker segments.
- Added a new `vibevoice` provider configuration for Microsoft VibeVoice ASR, supporting speaker diarization.
- Enhanced `.env.template` with clearer provider selection and model configuration options, including CUDA settings and voice activity detection.
- Improved `docker-compose.yml` to support multiple ASR providers with detailed service configurations.
- Introduced common utilities for audio processing and ASR service management in the `common` module, enhancing code reusability and maintainability.
- Updated `README.md` to reflect the new provider-based architecture and usage instructions for starting different ASR services.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a provider-based ASR architecture replacing the monolithic offline setup. It adds support for multiple speech-to-text providers (faster-whisper, NeMo, Transformers, VibeVoice), shared infrastructure components, configuration-driven provider selection, Docker services for each provider, and comprehensive test coverage with protocol validation and GPU-based integration tests.

Changes

Cohort / File(s) Summary
Configuration Templates
config/config.yml.template, extras/asr-services/.env.template
Added provider model configurations (STT capabilities: word_timestamps, segments, diarization) and environment variables for compute type, device, VAD, and per-provider settings; VibeVoice provider introduced with diarization support.
Common ASR Infrastructure
extras/asr-services/common/...
New shared modules: response_models.py (Pydantic data contracts), audio_utils.py (audio I/O and processing), base_service.py (abstract ASR service and FastAPI factory), init.py (centralized exports).
Faster-Whisper Provider
extras/asr-services/providers/faster_whisper/...
New provider implementation with Dockerfile, service orchestration, and transcriber using CTranslate2 backend; supports language detection, VAD filtering, and word timestamps.
NeMo Provider
extras/asr-services/providers/nemo/...
New provider implementation for NVIDIA NeMo models (Parakeet, Canary) with async warmup, word-level timestamps, and chunked processing support.
Transformers Provider
extras/asr-services/providers/transformers/...
New HuggingFace Transformers-based provider supporting Whisper variants and custom models; includes flash attention and multi-language support.
VibeVoice Provider
extras/asr-services/providers/vibevoice/...
New provider for Microsoft VibeVoice with native diarization and speaker identification; includes repository cloning logic and segment/speaker output mapping.
Docker & Build Configuration
extras/asr-services/docker-compose.yml, extras/asr-services/.dockerignore
Added docker-compose services for each ASR provider (nemo-asr, faster-whisper-asr, vibevoice-asr, transformers-asr) with GPU reservations and environment configuration; updated dockerignore to include common/ and providers/.
Project & Dependency Management
extras/asr-services/pyproject.toml, extras/asr-services/__init__.py, extras/asr-services/providers/__init__.py
Updated project version to 0.2.0, reorganized dependencies into provider-specific groups (faster-whisper, transformers, nemo, vibevoice), added CUDA variant groups; package structure updated to expose common and provider modules.
Setup & Configuration Tools
extras/asr-services/init.py, wizard.py
Refactored init.py to support interactive provider/model selection with rich tables, provider-aware CUDA setup, and conditional config generation; wizard.py now includes VibeVoice ASR option.
Documentation
extras/asr-services/README.md
Comprehensive rewrite from Wyoming-based to provider-centric architecture; includes provider overview, quick start options, API endpoints, configuration profiles, and development guidance.
Test Framework & Keywords
tests/resources/asr_keywords.robot, tests/libs/mock_asr_server.py
New ASR-specific Robot Framework keywords for service lifecycle management and API testing; mock ASR server supporting multiple provider configurations with capability-based response generation.
ASR Test Suites
tests/asr/*_tests.robot
New test files: error_handling_tests.robot (HTTP error cases), gpu_integration_tests.robot (GPU-accelerated provider validation), protocol_tests.robot (API contract validation), vibevoice_diarization_test.robot (diarization-specific validation).
Mock Services & Utilities
tests/libs/mock_streaming_stt_server.py, tests/scripts/mock_transcription_server.py, tests/scripts/verify_mock_servers.py
Enhanced streaming STT mock with cumulative timestamp tracking; new WebSocket mock transcription server; verification script for mock service health checks across LLM and STT endpoints.
Test Infrastructure & Documentation
tests/Makefile, tests/.gitignore, tests/CONFIG_GUIDE.md, tests/DEBUG_GUIDE.md, tests/Dockerfile.mock-asr, tests/tags.md, tests/show_results.py, tests/run_failed_tests.sh
Added ASR test targets and GPU-specific test filtering; new test configuration guide, debug guide, and result analysis tools; mock ASR Docker image; requires-gpu tag for GPU test categorization; per-test result organization and summary reporting.
Diarization Mock Infrastructure
tests/MOCK_SPEAKER_IMPLEMENTATION.md, tests/configs/mock-vibevoice.yml, tests/mocks/...
Added mock speaker recognition for CI testing without ML inference; includes mock client with precomputed segments, vibevoice test configuration, and integration guidance with environment variable control.
WebSocket E2E Testing
tests/integration/websocket_transcription_e2e_test.robot
New end-to-end WebSocket transcription tests validating monotonic timestamp ordering, segment timing accuracy, and streaming completion signals via Redis.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main objective of the changeset, which centers on enhancing ASR service configuration and provider management across multiple files and modules.
Docstring Coverage ✅ Passed Docstring coverage is 86.61% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/vibevoice-asr-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

⚠️ Robot Framework Test Results (No API Keys)

Status: ❌ Some tests failed

ℹ️ Note: This run excludes tests requiring external API keys (Deepgram, OpenAI).
Tests tagged with requires-api-keys will run on dev/main branches.

Metric Count
✅ Passed 87
❌ Failed 33
📊 Total 120

📊 View Reports

GitHub Pages (Live Reports):

Download Artifacts:


View full workflow run

@AnkushMalaker
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/tags.md (1)

200-200: ⚠️ Potential issue | 🟡 Minor

Stale tag count: "14" should be "15".

Line 200 still says "any tags other than the 14 approved tags above" but the document now defines 15 approved tags (updated on Line 272). Similarly, Line 236 references "existing 11 tags" which is also outdated.

📝 Proposed fix

At line 200:

-**DO NOT create or use any tags other than the 14 approved tags above.**
+**DO NOT create or use any tags other than the 15 approved tags above.**

At line 236:

-1. Can I use one of the existing 11 tags?
+1. Can I use one of the existing 15 tags?
🤖 Fix all issues with AI agents
In `@extras/asr-services/init.py`:
- Around line 155-167: The current comparison uses string `cuda_ver` (e.g.,
"12.10") and lexicographic comparisons which are incorrect for multi-digit
minors; instead convert `major` and `minor` to integers (e.g., major_i, minor_i)
and compare numeric tuples like (major_i, minor_i) against (12,8), (12,6),
(12,1) in the same branching where `match = re.search(...)` and `cuda_ver` is
built; replace the string comparisons (`if cuda_ver >= "12.8": ...`) with tuple
comparisons (e.g., `if (major_i, minor_i) >= (12, 8): return "cu128"`, etc.) and
keep the same fallback behavior and exception handling.
- Around line 290-329: The provider-specific configs in setup_provider_config
hardcode DEVICE="cuda" for "faster-whisper", "vibevoice", and "transformers"
which ignores the detected device (macOS/CPU) set by setup_cuda_version; change
these branches to use the detected device instead of "cuda" (e.g., set
self.config["DEVICE"] = self.config.get("DEVICE") or call/ensure
setup_cuda_version() was run and then assign self.config["DEVICE"] accordingly)
and remove the unconditional "cuda" literal so providers fall back to CPU/macOS
when no GPU is present.

In `@extras/asr-services/providers/faster_whisper/service.py`:
- Around line 48-55: Replace deprecated asyncio.get_event_loop() calls with
asyncio.get_running_loop() in async contexts: update the warmup() method in
faster_whisper.service (where loop = asyncio.get_event_loop() is used before
await loop.run_in_executor(...) and FasterWhisperTranscriber.load_model is
invoked), any transcribe() async methods in service classes, and async timing
functions like run_mic_transcription() in client.py; ensure you call
asyncio.get_running_loop() and keep using loop.run_in_executor(...) for
offloading blocking work so the code runs without DeprecationWarnings or future
errors.

In `@extras/asr-services/providers/nemo/transcriber.py`:
- Around line 92-96: The current async transcribe method in transcriber.py is
calling the blocking NeMo call self.model.transcribe(...) inside async with
self._lock, which blocks the event loop; change it to offload the blocking call
to a thread using asyncio.get_event_loop().run_in_executor (or
loop.run_in_executor) so the event loop stays responsive. Keep the async lock
for concurrency control but use await loop.run_in_executor(None, lambda:
self.model.transcribe([audio_file_path], batch_size=1, timestamps=True)) (or
make transcribe synchronous and let NemoService call run_in_executor); ensure
the torch.no_grad() context and any GPU device setup are executed inside the
executor task so GPU/CPU work does not run on the event loop.

In `@extras/asr-services/providers/transformers/Dockerfile`:
- Around line 19-43: The runtime Dockerfile is running as root and installs an
unnecessary git package; create and switch to a non-root user in the runtime
stage (e.g., add a user/group like "app" and use USER app) and remove "git" from
the apt-get install list in the runtime stage where curl and ffmpeg are
installed; ensure the created user's home and /app ownership match ENV PATH
usage and any copied files so the app can run without root, and apply the same
USER + removal of git change consistently in the other provider Dockerfiles
(transformers, nemo, faster_whisper, vibevoice).

In `@extras/asr-services/providers/vibevoice/Dockerfile`:
- Around line 41-44: The Dockerfile currently runs RUN git clone
https://github.com/microsoft/VibeVoice.git /models/vibevoice in the final image
(ENV HF_HOME="/models") which makes builds non-reproducible and forces git into
the runtime image; modify the Dockerfile to clone a specific commit or tag in
the builder stage (after the uv sync step) and replace the runtime-stage git
clone with a COPY --from=builder /models/vibevoice /models/vibevoice, remove git
from the runtime apt-get install, and ensure the clone uses a pinned commit SHA
or tag instead of the default branch to make builds reproducible and reduce
attack surface.

In `@extras/asr-services/providers/vibevoice/transcriber.py`:
- Around line 260-274: Change the two logger.info calls that print raw_output
previews to logger.debug and limit the preview size to avoid excessive logging;
then replace the greedy regexes used to locate the JSON array (the patterns
around raw_output and json_match) with non-greedy patterns and safer matching:
use a non-greedy quantifier (e.g., .*?) with re.DOTALL when searching for the
array of objects (the json_match for the assistant response) and similarly for
the fallback array search, or implement a short-circuit that finds the first '['
and then scans for the matching ']' to avoid catastrophic backtracking; update
references to raw_output and json_match in this block accordingly.
- Around line 285-298: The fallback parsing in _parse_vibevoice_output currently
sets "speaker" to a prefixed string using f"Speaker {seg.get('Speaker', 0)}",
which causes double-prefixing when _map_to_result later wraps it again; change
the fallback to store the raw speaker value (e.g., seg.get("Speaker", 0)
converted to int or kept as numeric) under the segment dict (use a key like
"speaker" or "speaker_raw" consistent with _map_to_result's expectations) so
that _map_to_result (and its speaker_id path) is solely responsible for
formatting the "Speaker {id}" label. Ensure types match what _map_to_result
expects to avoid further conversions.

In `@extras/asr-services/pyproject.toml`:
- Around line 97-103: Remove the unused "diffusers>=0.30.0" entry from both
dependency groups in pyproject.toml: delete the "diffusers>=0.30.0" line under
the transformers array and likewise remove it from the vibevoice dependency
group so neither the transformers nor vibevoice lists include diffusers.

In `@tests/asr/protocol_tests.robot`:
- Around line 159-174: The test "ASR Capabilities Are From Known Set" is using
an incomplete known capabilities list in the variable @{known_caps}, causing
valid provider capabilities to be treated as unknown; update @{known_caps} to
include the additional capabilities returned by providers (add
language_detection, vad_filter, translation, chunked_processing,
speaker_identification, long_form) so the assertion over ${info}[capabilities]
only fails for truly unknown entries. Ensure the list in the test matches all
supported capability strings returned by FasterWhisperService,
TransformersService, NemoService, and VibeVoiceService.
- Around line 90-126: The tests unconditionally assert on ${json}[words] but
must skip when the provider lacks the word_timestamps capability; add a guard at
the start of both test cases (ASR Transcribe Response Words Have Timestamps and
ASR Transcribe Response Words Have Confidence Scores) that checks the provider
capability (e.g., inspect ${capabilities}[word_timestamps] or call an existing
capability-check keyword) and uses "Skip Test" or "Run Keyword Unless" to skip
the test when word_timestamps is false, then keep the existing assertions on
${json}[words], ${first_word}, start/end and confidence as-is; apply the same
conditional pattern used by the segment/diarization tests to both test cases.

In `@tests/libs/mock_asr_server.py`:
- Around line 43-67: The mock response models (HealthResponse, InfoResponse,
WordTimestamp, TranscriptionResult) must match the canonical models: either
import and reuse the classes from common/response_models.py or update the mock
definitions so WordTimestamp.confidence is Optional[float],
TranscriptionResult.segments is typed as List[Segment] (not list[dict]), and
include optional fields speakers: Optional[List[Speaker]], language:
Optional[str], and duration: Optional[float]; update the mock module to
reference the canonical Segment and Speaker types (or import them) so tests
exercise the same schema as production.

In `@tests/resources/asr_keywords.robot`:
- Around line 88-103: The subprocess is setting ASR_MODEL to an empty string
which overrides docker-compose defaults and the @{env_vars} list is unused;
update the Start GPU ASR Service invocation so you remove the dead @{env_vars}
construction and only include ASR_MODEL in the env passed to Run Process when
"${model}" is not empty (keep env:ASR_PORT always), i.e., conditionally add the
ASR_MODEL environment entry in the Run Process call instead of unconditionally
using env:ASR_MODEL=${model} and delete the unused @{env_vars} lines.
🟡 Minor comments (23)
extras/asr-services/providers/__init__.py-1-8 (1)

1-8: ⚠️ Potential issue | 🟡 Minor

Add vibevoice provider to the documentation.

The PR objectives mention adding "a new vibevoice provider configuration for Microsoft VibeVoice ASR with support for speaker diarization," but this provider is not listed in the module docstring. This omission may confuse developers looking for available providers.

📝 Proposed fix to include vibevoice
 """
 ASR Provider implementations.
 
 Available providers:
 - faster_whisper: Fast Whisper inference using CTranslate2
 - transformers: HuggingFace Transformers for general ASR models
 - nemo: NVIDIA NeMo ASR for Parakeet, Canary, etc.
+- vibevoice: Microsoft VibeVoice ASR with speaker diarization
 """
tests/integration/websocket_transcription_e2e_test.robot-294-297 (1)

294-297: ⚠️ Potential issue | 🟡 Minor

Misleading comment: this allows small overlaps, not gaps.

${start} >= ${prev_end} - 0.1 permits the next segment's start to be up to 0.1s before the previous segment's end (i.e., overlap). Gaps (start > prev_end) would pass naturally. Update the comment to match the actual intent.

-        # Start time must be >= previous end time (allowing small gaps between segments)
+        # Start time must be >= previous end time (allowing up to 0.1s overlap between segments)
         Should Be True    ${start} >= ${prev_end} - 0.1
wizard.py-501-504 (1)

501-504: ⚠️ Potential issue | 🟡 Minor

"vibevoice".capitalize()"Vibevoice", not "VibeVoice".

str.capitalize() will produce "Vibevoice" which doesn't match the product branding "VibeVoice". Consider using a display-name mapping instead.

Proposed fix
+    _PROVIDER_DISPLAY = {"parakeet": "Parakeet", "vibevoice": "VibeVoice"}
+
     # Auto-add asr-services if local ASR was chosen (Parakeet or VibeVoice)
     if transcription_provider in ("parakeet", "vibevoice") and 'asr-services' not in selected_services:
-        console.print(f"[blue][INFO][/blue] Auto-adding ASR services for {transcription_provider.capitalize()} transcription")
+        console.print(f"[blue][INFO][/blue] Auto-adding ASR services for {_PROVIDER_DISPLAY.get(transcription_provider, transcription_provider)} transcription")
         selected_services.append('asr-services')
tests/DEBUG_GUIDE.md-167-169 (1)

167-169: ⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

Static analysis (MD040) flags fenced code blocks without a language. Use a plain text identifier for output blocks.

Proposed fix
 Expected output:
-```
+```text
 CONFIG_FILE=/app/test-configs/mock-services.yml
</details>

</blockquote></details>
<details>
<summary>tests/MOCK_SPEAKER_IMPLEMENTATION.md-204-204 (1)</summary><blockquote>

`204-204`: _⚠️ Potential issue_ | _🟡 Minor_

**Hardcoded absolute path leaks local developer environment.**

`/home/ankush/workspaces/friend-lite/tests/TODO_MOCK_SPEAKER_RECOGNITION.md` is a machine-specific path that won't work for any other contributor. Use a repository-relative path or remove the reference if the file isn't committed.

</blockquote></details>
<details>
<summary>tests/show_results.py-189-191 (1)</summary><blockquote>

`189-191`: _⚠️ Potential issue_ | _🟡 Minor_

**`relative_to(Path.cwd())` raises `ValueError` if the path is not under cwd.**

If the script is invoked from a directory outside the test results tree, this will crash. Consider wrapping it or falling back to the absolute path.

<details>
<summary>Proposed fix</summary>

```diff
-    print(f"Results from: {colorize(str(xml_file.relative_to(Path.cwd())), Colors.BLUE)}")
+    try:
+        display_path = str(xml_file.relative_to(Path.cwd()))
+    except ValueError:
+        display_path = str(xml_file)
+    print(f"Results from: {colorize(display_path, Colors.BLUE)}")
extras/asr-services/.env.template-4-8 (1)

4-8: ⚠️ Potential issue | 🟡 Minor

vibevoice is missing from the provider selection comment.

Line 7 lists faster-whisper, transformers, nemo but the PR introduces a dedicated vibevoice provider (with its own docker-compose service vibevoice-asr, config profile, and README section). Users referencing this template won't know vibevoice is an option.

Proposed fix
-# Choose one of: faster-whisper, transformers, nemo
+# Choose one of: faster-whisper, transformers, nemo, vibevoice
tests/run_failed_tests.sh-33-33 (1)

33-33: ⚠️ Potential issue | 🟡 Minor

Hardcoded test count is fragile.

The total 6 is hardcoded. If tests are added or removed from the tests array, this will be wrong.

Proposed fix
-    echo "Test $test_count/6: $test_name"
+    echo "Test $test_count/${`#tests`[@]}: $test_name"
tests/scripts/mock_transcription_server.py-170-174 (1)

170-174: ⚠️ Potential issue | 🟡 Minor

KeyboardInterrupt will not be caught inside an async coroutine.

asyncio.run() intercepts KeyboardInterrupt at the event-loop level and cancels running tasks — it never propagates into the coroutine as a KeyboardInterrupt. This means server.stop() on line 174 is effectively dead code. Move the interrupt handling to the synchronous call site (around asyncio.run(main())), consistent with how the sibling mock in tests/libs/mock_streaming_stt_server.py handles it.

Proposed fix
 async def main():
     """Main entry point."""
     import argparse

     parser = argparse.ArgumentParser(description="Mock WebSocket Transcription Server")
     parser.add_argument("--host", default="0.0.0.0", help="Host to bind to (default: 0.0.0.0)")
     parser.add_argument("--port", type=int, default=9999, help="Port to bind to (default: 9999)")
     args = parser.parse_args()

     server = MockTranscriptionServer(host=args.host, port=args.port)
-
-    try:
-        await server.run_forever()
-    except KeyboardInterrupt:
-        logger.info("⏹️ Keyboard interrupt received")
-        await server.stop()
+    await server.run_forever()


 if __name__ == "__main__":
-    asyncio.run(main())
+    try:
+        asyncio.run(main())
+    except KeyboardInterrupt:
+        logger.info("⏹️ Keyboard interrupt received, shutting down")
tests/show_results.py-152-162 (1)

152-162: ⚠️ Potential issue | 🟡 Minor

find_parent_suite may return a grandparent instead of the immediate parent.

suite.findall(".//test") matches descendants at any depth, so for a nested suite structure like Root > SubSuite > Test, this function could return Root instead of SubSuite. This produces incorrect suite paths in the output.

A reliable approach is to build a parent map once before traversal:

Proposed fix

Replace the find_parent_suite function and its usage with a pre-built parent map:

-def find_parent_suite(root, element):
-    """Find parent suite of an element"""
-    for suite in root.findall(".//suite"):
-        for child in suite:
-            if child == element:
-                return suite
-            # Check nested tests
-            for test in suite.findall(".//test"):
-                if test == element:
-                    return suite
-    return None
+def build_parent_map(root):
+    """Build a mapping from child elements to their parent elements."""
+    parent_map = {}
+    for parent in root.iter():
+        for child in parent:
+            parent_map[child] = parent
+    return parent_map

Then in parse_results, build the map once and use it:

parent_map = build_parent_map(root)
# ...
# Replace the while loop (lines 116-122):
parent = parent_map.get(test)
while parent is not None and parent.tag == "suite":
    suite_name = parent.get("name")
    if suite_name:
        suite_path.insert(0, suite_name)
    parent = parent_map.get(parent)
extras/asr-services/README.md-212-218 (1)

212-218: ⚠️ Potential issue | 🟡 Minor

Clarify NeMo configuration documentation.

The README documents CHUNKING_ENABLED, MIN_AUDIO_FOR_CHUNKING, and CHUNK_DURATION_SECONDS as NeMo configuration options, but the .env.template (lines 69–73) explicitly states "No additional configuration needed" for NeMo. These variables are actively used in the codebase (parakeet-offline.py, nemo/Dockerfile, docker-compose.yml). Either document them in .env.template with their defaults, or clarify in the README that they are internal defaults and not user-configurable.

extras/asr-services/init.py-376-391 (1)

376-391: ⚠️ Potential issue | 🟡 Minor

Hardcoded legacy references in next steps regardless of selected provider.

  1. Step 2 always suggests the parakeet pre-configured profile (cp configs/parakeet.env .env && docker compose up --build -d nemo-asr), even when the user selected a different provider.
  2. Step 5 references PARAKEET_ASR_URL regardless of provider, which is a legacy naming artifact.

Consider making step 2 provider-aware (or omitting it for non-nemo providers) and using a provider-neutral env var name like ASR_URL.

extras/asr-services/init.py-253-258 (1)

253-258: ⚠️ Potential issue | 🟡 Minor

Empty custom model identifier is silently accepted.

If the user selects "Enter custom model URL" and presses Enter without typing anything, prompt_value returns "". This empty string flows into setup_provider_config and gets written to .env as ASR_MODEL=, which will likely cause a startup failure downstream.

🛡️ Proposed fix: validate non-empty input
         if choice == str(len(models) + 1):
             # Custom model
             custom_model = self.prompt_value("Enter model identifier (HuggingFace repo or path)")
-            return custom_model
+            if not custom_model.strip():
+                self.console.print("[yellow]No model entered, using provider default[/yellow]")
+                return provider_info["default_model"]
+            return custom_model
extras/asr-services/providers/vibevoice/__init__.py-1-5 (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Missing public exports — inconsistent with all other provider packages.

Every other provider package (faster_whisper, nemo, transformers) re-exports their Service and Transcriber classes via __all__, enabling clean imports like from providers.nemo import NemoService. This file should follow the same pattern for VibeVoiceService and VibeVoiceTranscriber.

♻️ Proposed fix: add exports
 """
 VibeVoice ASR provider.
 
 Dedicated provider for Microsoft's VibeVoice-ASR model with speaker diarization.
 """
+
+from providers.vibevoice.service import VibeVoiceService
+from providers.vibevoice.transcriber import VibeVoiceTranscriber
+
+__all__ = ["VibeVoiceService", "VibeVoiceTranscriber"]
config/config.yml.template-158-181 (1)

158-181: ⚠️ Potential issue | 🟡 Minor

VibeVoice and Parakeet both default to port 8767 — potential conflict.

stt-vibevoice defaults to host.docker.internal:8767 and stt-parakeet-batch defaults to 172.17.0.1:8767. While the hosts differ, running both on the same machine could still cause a port collision since host.docker.internal and 172.17.0.1 both resolve to the Docker host. Consider using a different default port for one of the providers (e.g., 8768 for VibeVoice).

tests/configs/mock-vibevoice.yml-62-66 (1)

62-66: ⚠️ Potential issue | 🟡 Minor

Mock config capabilities don't match the real VibeVoice provider config.

The comment on line 62 says "no word-level timestamps," yet timestamps is listed in capabilities (line 66). Meanwhile, the real stt-vibevoice entry in config/config.yml.template (lines 169–171) only declares segments and diarization — no timestamps. This mismatch means protocol tests may pass against the mock but fail against the actual provider.

Proposed fix: align with the real provider config
     # VibeVoice capabilities: segments with speakers, but no word-level timestamps
     capabilities:
       - segments
       - diarization
-      - timestamps
extras/asr-services/providers/faster_whisper/transcriber.py-29-50 (1)

29-50: ⚠️ Potential issue | 🟡 Minor

DEVICE_INDEX parsing can raise an unhandled ValueError.

Line 41: int(os.getenv("DEVICE_INDEX", "0")) will throw a ValueError if the env var is set to a non-numeric string. Consider wrapping with a fallback or validating at init.

This is a minor risk since it's operator-configured, but it would produce an unhelpful traceback.

🛡️ Proposed fix
-        self.device_index = int(os.getenv("DEVICE_INDEX", "0"))
+        try:
+            self.device_index = int(os.getenv("DEVICE_INDEX", "0"))
+        except ValueError:
+            logger.warning("Invalid DEVICE_INDEX, defaulting to 0")
+            self.device_index = 0
extras/asr-services/providers/vibevoice/transcriber.py-270-278 (1)

270-278: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix on line 272.

Line 272 is an f-string with no placeholders (Ruff F541).

-            logger.warning(f"Output does not match pattern [{{...}}], checking for other formats...")
+            logger.warning("Output does not match pattern [{...}], checking for other formats...")
tests/scripts/verify_mock_servers.py-146-146 (1)

146-146: ⚠️ Potential issue | 🟡 Minor

Use Pythonic truthiness checks instead of == False / == True.

Per Ruff E712, prefer not initial_data["is_final"] and if final_data["is_final"].

🔧 Proposed fix
-            assert initial_data["is_final"] == False
+            assert not initial_data["is_final"]
             ...
-            assert final_data["is_final"] == True
+            assert final_data["is_final"]

Also applies to: 163-163

tests/scripts/verify_mock_servers.py-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Network calls lack timeouts — script may hang if servers are down.

urllib.request.urlopen and websockets.connect have no explicit timeouts. If the target server isn't running, the script will hang indefinitely instead of failing fast.

🔧 Proposed fix — add timeouts
-        with urllib.request.urlopen('http://localhost:11435/v1/models') as response:
+        with urllib.request.urlopen('http://localhost:11435/v1/models', timeout=10) as response:
-        async with websockets.connect("ws://localhost:9999") as ws:
+        async with websockets.connect("ws://localhost:9999", open_timeout=10) as ws:

Apply similarly to all urlopen calls.

Also applies to: 56-56, 87-87, 142-142

extras/asr-services/common/base_service.py-156-157 (1)

156-157: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix on string without placeholders.

Line 157 has an f-string with no interpolation expressions.

-        logger.info(f"Transcription request started")
+        logger.info("Transcription request started")
extras/asr-services/common/base_service.py-192-197 (1)

192-197: ⚠️ Potential issue | 🟡 Minor

Chain the re-raised exception with from e for proper traceback.

When wrapping an exception inside an except block, use raise ... from e to preserve the exception chain and clearly distinguish it from an error in the handler itself. Also, logging.exception already includes the exception info, so passing e in the format string is redundant.

🐛 Proposed fix
         except Exception as e:
             error_time = time.time() - request_start
-            logger.exception(f"Error after {error_time:.3f}s: {e}")
-            raise HTTPException(status_code=500, detail="Transcription failed")
+            logger.exception(f"Error after {error_time:.3f}s")
+            raise HTTPException(status_code=500, detail="Transcription failed") from e
extras/asr-services/common/audio_utils.py-50-70 (1)

50-70: ⚠️ Potential issue | 🟡 Minor

Missing clipping before integer conversion can cause overflow.

If input values exceed the [-1.0, 1.0] range (even slightly, due to floating-point arithmetic during resampling), multiplying by int16.max/int32.max and casting will silently wrap around, producing audio glitches.

🛡️ Proposed fix: clip before conversion
     if sample_width == 2:
-        audio_int = (audio_array * np.iinfo(np.int16).max).astype(np.int16)
+        audio_clipped = np.clip(audio_array, -1.0, 1.0)
+        audio_int = (audio_clipped * np.iinfo(np.int16).max).astype(np.int16)
         return audio_int.tobytes()
     elif sample_width == 4:
-        audio_int = (audio_array * np.iinfo(np.int32).max).astype(np.int32)
+        audio_clipped = np.clip(audio_array, -1.0, 1.0)
+        audio_int = (audio_clipped * np.iinfo(np.int32).max).astype(np.int32)
         return audio_int.tobytes()
🧹 Nitpick comments (34)
tests/Makefile (2)

293-316: Consider adding test-asr and test-asr-gpu to the test-all-with-slow-and-sdk target or documenting that GPU ASR tests require a separate invocation.

The test-all-with-slow-and-sdk target (line 285) runs everything without exclusions over $(TEST_DIR) which now includes asr. This means it will pick up requires-gpu tagged tests, potentially failing on machines without a GPU. If that target is intended to be the "run absolutely everything" option, this may be acceptable, but it's worth confirming the intent — especially since all explicitly excludes requires-gpu.


308-316: Hardcoded port 8767 in the informational note could drift.

Line 310 references port 8767. If the ASR service port is configurable or changes, this echo message will become misleading. Consider referencing a variable or the relevant config file instead.

extras/asr-services/providers/__init__.py (1)

1-8: Consider exposing providers via explicit imports or __all__.

The package initializer contains only documentation without exposing any public API. Consider adding explicit imports from submodules or defining __all__ to improve provider discoverability and establish a clear public interface for the package.

Example: Define __all__ for explicit package exports
"""
ASR Provider implementations.

Available providers:
- faster_whisper: Fast Whisper inference using CTranslate2
- transformers: HuggingFace Transformers for general ASR models
- nemo: NVIDIA NeMo ASR for Parakeet, Canary, etc.
- vibevoice: Microsoft VibeVoice ASR with speaker diarization
"""

__all__ = [
    "faster_whisper",
    "transformers",
    "nemo",
    "vibevoice",
]

Alternatively, if provider classes should be directly importable from the package root, add explicit imports.

wizard.py (1)

45-49: Stale service description — mentions only "Parakeet" but now also supports VibeVoice.

The asr-services description at line 48 still reads 'Offline speech-to-text (Parakeet)'. Since VibeVoice is now a supported provider, consider updating this.

Proposed fix
         'asr-services': {
             'path': 'extras/asr-services',
             'cmd': ['uv', 'run', '--with-requirements', '../../setup-requirements.txt', 'python', 'init.py'],
-            'description': 'Offline speech-to-text (Parakeet)'
+            'description': 'Offline speech-to-text (Parakeet / VibeVoice)'
         },
extras/asr-services/providers/nemo/Dockerfile (3)

27-29: build-essential in the runtime stage adds ~200 MB+ of build tools.

Unless NeMo or texterrors compiles C extensions at import/runtime, build-essential should stay only in the builder stage. If it is genuinely needed at runtime, add a comment explaining why.


31-34: COPY --from=builder /app /app copies build artifacts (pyproject.toml, uv.lock) into runtime.

The VibeVoice Dockerfile only copies .venv from the builder (COPY --from=builder /app/.venv /app/.venv). Consider aligning for a leaner image, unless additional files from /app are needed at runtime.

Proposed fix
-# Copy virtual environment and application
-COPY --from=builder /app /app
+# Copy virtual environment from builder
+COPY --from=builder /app/.venv /app/.venv
+
+# Copy application code
 COPY common/ ./common/
 COPY providers/nemo/ ./providers/nemo/

1-46: Container runs as root — consider adding a non-root USER.

Trivy DS002 flags the absence of a non-root user. While GPU/ML containers often run as root, adding a dedicated user improves the security posture. This is common practice for production images.

Proposed fix (at end of runtime stage, before CMD)
+RUN useradd --create-home appuser
+USER appuser
+
 CMD ["python", "-m", "providers.nemo.service", "--port", "8765"]
extras/asr-services/providers/vibevoice/Dockerfile (1)

1-56: Container runs as root — consider adding a non-root USER.

Same concern as the NeMo Dockerfile — Trivy DS002 flags the absence of a non-root user. Adding a dedicated user is recommended for production images.

Proposed fix (before CMD)
+RUN useradd --create-home appuser
+USER appuser
+
 CMD ["python", "-m", "providers.vibevoice.service", "--port", "8765"]
tests/run_failed_tests.sh (1)

64-72: Script always exits 0, even when tests fail.

If this script is used in CI, the lack of a non-zero exit code on failure means the pipeline won't detect regressions. Consider exiting with a failure code when failed_count > 0.

Proposed fix
 echo "Individual test results saved in: $OUTPUTDIR/"
+
+if [ "$failed_count" -gt 0 ]; then
+    exit 1
+fi
tests/scripts/mock_transcription_server.py (3)

16-21: Use sys.exit(1) instead of bare exit(1).

The built-in exit() is intended for the interactive interpreter. sys.exit() is the correct function for scripts, though sys would need to be imported at the top. Given the import sys isn't present, consider adding it or relying on raise SystemExit(1).


127-128: Use logger.exception to preserve the traceback.

logging.error discards the traceback. logger.exception logs at ERROR level and includes the exception info automatically, which is valuable for debugging unexpected failures even in a mock server.

Proposed fix
         except Exception as e:
-            logger.error(f"❌ Error handling connection from {client_id}: {e}")
+            logger.exception(f"❌ Error handling connection from {client_id}: {e}")

53-54: duration is tracked but never used.

The duration variable is incremented on every audio chunk but never referenced anywhere. Either remove it or use it (e.g., in log messages or response data).

extras/asr-services/init.py (1)

146-148: Partial executable path in subprocess call.

As flagged by Ruff S607, "nvidia-smi" is resolved via PATH, which could be exploited if the environment is compromised. This is low-risk in a developer setup script, but for defense-in-depth consider using the full path or documenting the assumption.

🛡️ Optional: use absolute path
-                ["nvidia-smi"],
+                ["/usr/bin/nvidia-smi"],

Alternatively, use shutil.which("nvidia-smi") to resolve and validate the path before invocation.

tests/Dockerfile.mock-asr (1)

1-19: Consider adding a non-root user and pinning dependency versions.

Two improvements for this test Dockerfile:

  1. Root user (Trivy DS002): The container runs as root. While acceptable for a test mock server, adding a non-root user is a good habit and prevents accidental misuse as a base for production images.

  2. Unpinned pip dependencies: fastapi, uvicorn, and python-multipart are installed without version constraints. A future breaking change in any of these could silently break the mock server in CI.

♻️ Proposed improvements
 FROM python:3.12-slim
 
 WORKDIR /app
 
 # Install dependencies (FastAPI + uvicorn + multipart for file uploads)
-RUN pip install --no-cache-dir fastapi uvicorn python-multipart
+RUN pip install --no-cache-dir fastapi>=0.115.12 uvicorn>=0.34.2 python-multipart>=0.0.6
 
 # Copy mock server script
 COPY tests/libs/mock_asr_server.py .
 
 # Expose HTTP port
 EXPOSE 8765
 
+# Run as non-root user
+RUN useradd --create-home appuser
+USER appuser
+
 # Provider mode can be set via environment variable
 # Options: mock, parakeet, vibevoice, deepgram
 ENV MOCK_ASR_PROVIDER=mock
 
 # Run server (provider mode read from env var)
 CMD ["sh", "-c", "python mock_asr_server.py --host 0.0.0.0 --port 8765 --provider $MOCK_ASR_PROVIDER"]
extras/asr-services/providers/vibevoice/impl.md (1)

1-17: This reads like raw conversation notes, not documentation.

This file appears to be an unedited transcript or voice note pasted verbatim into the repo. It contains valuable architectural context (VibeVoice combined ASR+diarization vs. Parakeet+pyannote pipeline), but in its current form it has grammatical issues, missing punctuation, and an informal conversational tone that isn't suitable for committed documentation.

Consider either:

  1. Rewriting it as structured implementation notes (e.g., "Architecture Decision: Combined vs. Separate ASR+Diarization") with clear sections.
  2. Moving it to an issue or discussion thread and removing it from the codebase.
  3. At minimum, adding a disclaimer at the top (e.g., <!-- Internal brainstorming notes — not user-facing docs -->).
extras/asr-services/providers/transformers/__init__.py (1)

1-11: Docstring misleadingly lists VibeVoice as a supported model.

Line 5 says "Supports models like VibeVoice-ASR, Whisper variants, and custom fine-tuned models," but the TransformersService class itself (in service.py, line ~33) explicitly states: "For VibeVoice-ASR with speaker diarization, use the dedicated vibevoice provider." Consider removing VibeVoice from this docstring to avoid confusion.

Proposed fix
-Supports models like VibeVoice-ASR, Whisper variants, and custom fine-tuned models.
+Supports Whisper variants and custom fine-tuned models.
extras/asr-services/providers/faster_whisper/Dockerfile (1)

19-43: Container runs as root — add a non-root user.

Trivy flagged DS002: the container has no USER directive, so it runs as root. This is a security best practice for production containers.

Proposed fix: add non-root user to the runtime stage
 # Copy virtual environment and application
 COPY --from=builder /app /app
 COPY common/ ./common/
 COPY providers/faster_whisper/ ./providers/faster_whisper/
 
+RUN useradd --create-home --shell /bin/bash appuser
+USER appuser
+
 ENV PATH="/app/.venv/bin:$PATH"
tests/asr/error_handling_tests.robot (1)

25-29: Redundant status code assertion.

POST On Session with expected_status=422 already fails the test if the status doesn't match, making the Should Be Equal As Integers on line 29 redundant. The same pattern repeats in the 404 test (line 38) and 405 test (line 67). Not a bug, but it's unnecessary noise.

extras/asr-services/providers/faster_whisper/service.py (1)

108-128: Environment mutation via os.environ in main() is a code smell.

Line 118 sets os.environ["ASR_MODEL"] from the CLI arg, then line 121 reads it back. This indirect flow is fragile—if the FasterWhisperTranscriber constructor reads the env var before main() sets it, the value could be stale. Consider passing model_id directly without the env mutation.

♻️ Proposed fix
 def main():
     """Main entry point for faster-whisper service."""
     parser = argparse.ArgumentParser(description="Faster-Whisper ASR Service")
     parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
     parser.add_argument("--port", type=int, default=8765, help="Port to bind to")
     parser.add_argument("--model", help="Model identifier", required=False)
     args = parser.parse_args()
 
-    # Set model via environment if provided
-    if args.model:
-        os.environ["ASR_MODEL"] = args.model
-
     # Get model ID
-    model_id = os.getenv("ASR_MODEL", "Systran/faster-whisper-large-v3")
+    model_id = args.model or os.getenv("ASR_MODEL", "Systran/faster-whisper-large-v3")
 
     # Create service and app
     service = FasterWhisperService(model_id)
extras/asr-services/providers/transformers/transcriber.py (2)

164-173: Only a single segment is produced regardless of audio length.

When text is non-empty, lines 165–173 create exactly one Segment spanning the entire transcription. This is a consequence of the word-level timestamp mode not providing segment boundaries. Downstream consumers expecting per-sentence or per-utterance segments (e.g., for subtitle generation) will get a single monolithic segment.

This is worth documenting — either in the docstring or as a comment — so consumers of the TransformersTranscriber understand this limitation.


92-103: Inconsistent device type passed to pipeline().

Line 102 passes self.device (a string like "cuda") when device is not CPU, but passes -1 (an integer) for CPU. Modern transformers pipeline() accepts device="cpu" directly; mixing string and integer types is unnecessarily inconsistent.

♻️ Proposed fix
         self.pipeline = pipeline(
             "automatic-speech-recognition",
             model=self.model,
             tokenizer=self.processor.tokenizer,
             feature_extractor=self.processor.feature_extractor,
             torch_dtype=self.torch_dtype,
-            device=self.device if self.device != "cpu" else -1,
+            device=self.device,
         )
extras/asr-services/providers/vibevoice/transcriber.py (2)

132-147: Chain the exception with raise ... from e for better tracebacks.

When re-raising as RuntimeError on line 144, chain it from the original ImportError to preserve the full traceback context. Also, logging.exception is preferred over logging.error in except blocks (Ruff B904, TRY400).

♻️ Proposed fix
         except ImportError as e:
-            logger.error(f"Failed to import VibeVoice modules: {e}")
+            logger.exception("Failed to import VibeVoice modules")
             raise RuntimeError(
                 f"Failed to import VibeVoice modules. "
                 f"Ensure the VibeVoice repository is properly cloned. Error: {e}"
-            )
+            ) from e

156-164: trust_remote_code=True is necessary for VibeVoice but warrants a security comment.

Line 163 enables execution of arbitrary code from the model repository. This is required for VibeVoice's custom model classes, but should be documented with a comment explaining why it's needed and accepted.

         self.model = VibeVoiceASRForConditionalGeneration.from_pretrained(
             self.model_id,
             torch_dtype=self.torch_dtype,
             device_map="auto" if self.device == "cuda" else None,
             attn_implementation=self.attn_impl,
-            trust_remote_code=True,
+            trust_remote_code=True,  # Required: VibeVoice uses custom model architecture
         )
extras/asr-services/providers/nemo/transcriber.py (1)

116-121: Hardcoded confidence=1.0 is misleading when NeMo provides no confidence scores.

NeMo's Parakeet models don't produce per-word confidence. Setting confidence=1.0 implies perfect confidence, which is inaccurate. The Word model accepts None (see response_models.py), and the Transformers provider correctly uses confidence=None when unavailable. Consider aligning to the same convention.

♻️ Proposed fix
                 word = Word(
                     word=word_data["word"],
                     start=word_data["start"],
                     end=word_data["end"],
-                    confidence=1.0,
+                    confidence=None,
                 )
extras/asr-services/providers/nemo/service.py (2)

47-54: Use asyncio.get_running_loop() instead of deprecated get_event_loop().

asyncio.get_event_loop() is deprecated since Python 3.10 when there's no running loop, and emits a DeprecationWarning. Since warmup is called from an async context (FastAPI startup), asyncio.get_running_loop() is the correct replacement.

Same applies to TransformersService.warmup and TransformersService.transcribe in providers/transformers/service.py.

♻️ Proposed fix
-        loop = asyncio.get_event_loop()
+        loop = asyncio.get_running_loop()

82-87: Capability "chunked_processing" may be misleading.

The NeMo transcriber docstring states "no custom chunking required" and the code does not implement chunking. Advertising "chunked_processing" as a capability could confuse consumers of the /info endpoint. Consider removing it or renaming it to something like "long_audio" that reflects NeMo's native long-audio support.

extras/asr-services/docker-compose.yml (1)

28-29: All services bind to the same default host port — docker compose up will fail.

All four new services default to ${ASR_PORT:-8767}:8765. Running docker compose up without specifying a service name will attempt to start all of them, causing port-binding failures. While the header comment says "run ONE provider at a time," this is easy to miss.

Consider using Docker Compose profiles to prevent accidental multi-service startup:

♻️ Example: add profiles to each provider
   nemo-asr:
+    profiles: ["nemo"]
     build:
       ...

   faster-whisper-asr:
+    profiles: ["faster-whisper"]
     build:
       ...

   vibevoice-asr:
+    profiles: ["vibevoice"]
     build:
       ...

   transformers-asr:
+    profiles: ["transformers"]
     build:
       ...

Then start with: docker compose --profile nemo up -d

Also applies to: 62-63, 94-95, 125-126

extras/asr-services/common/base_service.py (2)

26-99: Consider adding a setter or method for _is_ready instead of mutating it externally.

Line 123 in create_asr_app directly sets service._is_ready = True, breaking the encapsulation of the private attribute. A simple public method would keep the invariant within the class.

♻️ Suggested refactor

Add to BaseASRService:

def mark_ready(self) -> None:
    """Mark the service as ready to handle requests."""
    self._is_ready = True

Then in create_asr_app:

     await service.warmup()
-    service._is_ready = True
+    service.mark_ready()

145-190: /transcribe endpoint returns JSONResponse manually instead of using response_model.

The /health and /info endpoints declare response_model and return Pydantic model instances directly, but /transcribe manually calls result.to_dict() and wraps it in JSONResponse. This skips Pydantic's response validation and creates inconsistency. You could declare response_model=TranscriptionResult and return the model instance directly (FastAPI will handle serialization, including exclude_none).

Alternatively, if you need the custom to_dict exclusion behavior, this is fine — but the inconsistency is worth noting.

tests/asr/gpu_integration_tests.robot (1)

79-93: Consider caching the transcription response across test cases to reduce GPU inference time.

Three separate test cases (GPU ASR Transcription Returns Text, GPU ASR Transcription Has Word Timestamps, GPU ASR Transcription Duration Reasonable) each independently upload and transcribe the same audio file. With real GPU inference, each call could take significant time.

You could transcribe once in a suite-level keyword (or a shared test setup) and store the result in a suite variable, then validate different aspects in each test case. This would cut GPU inference from 3× to 1× for these tests.

That said, independent tests are more robust for isolation, so this is a trade-off.

Also applies to: 95-125, 127-149

extras/asr-services/common/response_models.py (2)

63-76: to_dict() includes confidence: null for words but excludes top-level None fields.

model_dump() on Word objects (Line 67) will include "confidence": null in the output when no confidence is available, while top-level optional fields (speakers, language, duration) are explicitly excluded when None. This inconsistency could surprise API consumers.

If you want consistent None-exclusion, use model_dump(exclude_none=True) for nested models too:

-            "words": [w.model_dump() for w in self.words],
-            "segments": [s.model_dump() for s in self.segments],
+            "words": [w.model_dump(exclude_none=True) for w in self.words],
+            "segments": [s.model_dump(exclude_none=True) for s in self.segments],

Similarly for speakers on Line 71. If confidence: null is intentionally part of the contract, this is fine as-is.


79-98: HealthResponse and InfoResponse look good and align with the mock server's shapes.

Note: The mock server in tests/libs/mock_asr_server.py redefines these models without Field descriptors and with slightly different typing (list[str] vs List[str]). While functionally equivalent, having the mock import from this module (or use a shared contract) would prevent future drift.

extras/asr-services/common/audio_utils.py (1)

22-47: sample_rate and channels parameters are accepted but unused.

convert_audio_to_numpy takes sample_rate and channels arguments but never references them in the function body. This misleads callers into thinking these parameters affect the conversion. Since load_audio_file handles mono conversion separately via convert_to_mono, either remove these unused params or document that they're present for future use / interface uniformity.

♻️ Option A: Remove unused params
 def convert_audio_to_numpy(
     audio_bytes: bytes,
-    sample_rate: int,
     sample_width: int = 2,
-    channels: int = 1,
 ) -> np.ndarray:

Then update the call site in load_audio_file (Line 149):

         audio_array = convert_audio_to_numpy(
-            audio_bytes, sample_rate, sample_width, channels
+            audio_bytes, sample_width=sample_width
         )
tests/libs/mock_asr_server.py (1)

103-105: Silent fallback to "mock" for unknown providers.

get_provider_config() silently returns the "mock" config when an unrecognized provider is specified. This could mask configuration errors. Consider logging a warning when falling back.

♻️ Proposed fix
 def get_provider_config():
     """Get configuration for the current provider mode."""
-    return PROVIDER_CONFIGS.get(PROVIDER_MODE, PROVIDER_CONFIGS["mock"])
+    config = PROVIDER_CONFIGS.get(PROVIDER_MODE)
+    if config is None:
+        logger.warning(f"Unknown provider '{PROVIDER_MODE}', falling back to 'mock'")
+        config = PROVIDER_CONFIGS["mock"]
+    return config

Comment on lines +155 to +167
match = re.search(r'CUDA Version:\s*(\d+)\.(\d+)', output)
if match:
major, minor = match.groups()
cuda_ver = f"{major}.{minor}"
if cuda_ver >= "12.8":
return "cu128"
elif cuda_ver >= "12.6":
return "cu126"
elif cuda_ver >= "12.1":
return "cu121"
except (subprocess.SubprocessError, FileNotFoundError):
pass
return "cu126" # Default fallback to cu126
return "cu126"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: String comparison for CUDA version produces incorrect results for multi-digit minor versions.

cuda_ver is a string like "12.8", and >= performs lexicographic comparison. This works by coincidence for the current values but will break for versions like "12.10" (since "12.10" < "12.8" lexicographically). Convert to a numeric tuple for correct ordering.

🐛 Proposed fix
                 if match:
                     major, minor = match.groups()
-                    cuda_ver = f"{major}.{minor}"
-                    if cuda_ver >= "12.8":
+                    cuda_ver = (int(major), int(minor))
+                    if cuda_ver >= (12, 8):
                         return "cu128"
-                    elif cuda_ver >= "12.6":
+                    elif cuda_ver >= (12, 6):
                         return "cu126"
-                    elif cuda_ver >= "12.1":
+                    elif cuda_ver >= (12, 1):
                         return "cu121"
🤖 Prompt for AI Agents
In `@extras/asr-services/init.py` around lines 155 - 167, The current comparison
uses string `cuda_ver` (e.g., "12.10") and lexicographic comparisons which are
incorrect for multi-digit minors; instead convert `major` and `minor` to
integers (e.g., major_i, minor_i) and compare numeric tuples like (major_i,
minor_i) against (12,8), (12,6), (12,1) in the same branching where `match =
re.search(...)` and `cuda_ver` is built; replace the string comparisons (`if
cuda_ver >= "12.8": ...`) with tuple comparisons (e.g., `if (major_i, minor_i)
>= (12, 8): return "cu128"`, etc.) and keep the same fallback behavior and
exception handling.

Comment on lines +290 to +329
def setup_provider_config(self, provider: str, model: str):
"""Configure provider-specific settings"""
self.print_section("Provider Configuration")

self.config["ASR_PROVIDER"] = provider
self.config["ASR_MODEL"] = model
self.config["ASR_PORT"] = "8767"

if provider == "faster-whisper":
self.config["COMPUTE_TYPE"] = "float16"
self.config["DEVICE"] = "cuda"
self.config["VAD_FILTER"] = "true"

# Ask about language
if Confirm.ask("Force specific language?", default=False):
lang = self.prompt_value("Language code (e.g., en, hi, es)", default="")
if lang:
self.config["LANGUAGE"] = lang

elif provider == "vibevoice":
# VibeVoice uses transformers backend with specific optimizations
self.config["TORCH_DTYPE"] = "float16"
self.config["DEVICE"] = "cuda"
self.config["USE_FLASH_ATTENTION"] = "true"
self.console.print("[blue][INFO][/blue] Enabled Flash Attention for VibeVoice")
self.console.print("[blue][INFO][/blue] VibeVoice provides built-in speaker diarization (no pyannote needed)")

elif provider == "transformers":
self.config["TORCH_DTYPE"] = "float16"
self.config["DEVICE"] = "cuda"
self.config["USE_FLASH_ATTENTION"] = "false"

# Hindi model-specific
if "hindi" in model.lower():
self.config["LANGUAGE"] = "hi"
self.console.print("[blue][INFO][/blue] Set language to Hindi for Hindi Whisper model")

elif provider == "nemo":
# NeMo's transcribe() handles long audio natively - no extra config needed
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

vibevoice and faster-whisper set DEVICE=cuda unconditionally — no macOS/CPU fallback.

setup_cuda_version (called for nemo/transformers) handles macOS detection and sets cpu, but the provider-specific config for vibevoice, faster-whisper, and transformers hardcodes DEVICE = "cuda". On a system without a GPU (or on macOS), this will produce an unusable configuration.

🐛 Proposed fix: use detected device
     def setup_provider_config(self, provider: str, model: str):
         """Configure provider-specific settings"""
         self.print_section("Provider Configuration")
 
         self.config["ASR_PROVIDER"] = provider
         self.config["ASR_MODEL"] = model
         self.config["ASR_PORT"] = "8767"
 
+        is_macos = platform.system() == 'Darwin'
+        default_device = "cpu" if is_macos else "cuda"
+
         if provider == "faster-whisper":
             self.config["COMPUTE_TYPE"] = "float16"
-            self.config["DEVICE"] = "cuda"
+            self.config["DEVICE"] = default_device
             self.config["VAD_FILTER"] = "true"

Apply the same pattern for vibevoice and transformers.

🤖 Prompt for AI Agents
In `@extras/asr-services/init.py` around lines 290 - 329, The provider-specific
configs in setup_provider_config hardcode DEVICE="cuda" for "faster-whisper",
"vibevoice", and "transformers" which ignores the detected device (macOS/CPU)
set by setup_cuda_version; change these branches to use the detected device
instead of "cuda" (e.g., set self.config["DEVICE"] = self.config.get("DEVICE")
or call/ensure setup_cuda_version() was run and then assign
self.config["DEVICE"] accordingly) and remove the unconditional "cuda" literal
so providers fall back to CPU/macOS when no GPU is present.

Comment on lines +48 to +55
async def warmup(self) -> None:
"""Initialize and warm up the model."""
logger.info(f"Initializing faster-whisper with model: {self.model_id}")

# Load model (runs in thread pool to not block)
loop = asyncio.get_event_loop()
self.transcriber = FasterWhisperTranscriber(self.model_id)
await loop.run_in_executor(None, self.transcriber.load_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all uses of get_event_loop in the extras/asr-services directory
rg -n 'get_event_loop' extras/asr-services/

Repository: SimpleOpenSoftware/chronicle

Length of output: 1541


🏁 Script executed:

#!/bin/bash
# Check Python version requirements
find . -type f \( -name 'requirements*.txt' -o -name 'setup.py' -o -name 'pyproject.toml' -o -name '.python-version' \) | head -20

Repository: SimpleOpenSoftware/chronicle

Length of output: 820


🏁 Script executed:

#!/bin/bash
# Get the context around get_event_loop calls - show the method they're in
rg -B 5 -A 2 'get_event_loop' extras/asr-services/

Repository: SimpleOpenSoftware/chronicle

Length of output: 10341


🏁 Script executed:

#!/bin/bash
# Check Python version requirement
cat extras/asr-services/pyproject.toml | grep -A 5 "python"

Repository: SimpleOpenSoftware/chronicle

Length of output: 508


🏁 Script executed:

#!/bin/bash
# Check the context of client.py to confirm it's all async
head -100 extras/asr-services/client.py | grep -E "^async def|^def|^class"

Repository: SimpleOpenSoftware/chronicle

Length of output: 308


🌐 Web query:

asyncio.get_event_loop() deprecated Python 3.10 get_running_loop()

💡 Result:

asyncio.get_event_loop() itself isn’t fully removed in Python 3.10+, but calling it when there is no running loop is what triggers the DeprecationWarning (“there is no current event loop …”). The docs recommend asyncio.get_running_loop() inside async code, and asyncio.run() (or explicit loop creation) from sync code. In a future Python release, this “no running loop” case is expected to become an error, and get_event_loop() is intended to behave like get_running_loop() (i.e., raise RuntimeError when no loop is running). [1][2]

What to use instead

1) Inside a coroutine / callback (async context)

Use get_running_loop() (guarantees you’re on an active loop):

import asyncio

async def main():
    loop = asyncio.get_running_loop()
    # use loop...

asyncio.run(main())

[1]

2) From synchronous “top-level” code (starting your program)

Prefer asyncio.run():

import asyncio

async def main():
    ...

if __name__ == "__main__":
    asyncio.run(main())

[1]

3) If you truly need manual loop management (advanced / special cases)

Create and set a loop explicitly:

import asyncio

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
    loop.run_until_complete(main())
finally:
    loop.close()

(Using new_event_loop() avoids relying on the deprecated “implicit loop creation” behavior of get_event_loop().) [1][2]

Why the warning appears in 3.10+

Python tightened behavior around “no running loop”: in 3.10+ get_event_loop() can emit a DeprecationWarning when there’s no running loop (with details varying by patch releases), and “in some future Python release this will become an error.” [1]


Sources

[1] Python docs: asyncio — Event Loop (get_running_loop, get_event_loop, deprecation warning notes)
[2] CPython issue discussion referencing asyncio docs + intended future behavior (aliasing/behavior change)


Use asyncio.get_running_loop() instead of asyncio.get_event_loop() across the codebase.

asyncio.get_event_loop() is deprecated since Python 3.10 and triggers a DeprecationWarning when called without a running loop. Since this project requires Python ≥ 3.10, all users are affected. The function will become an error in future Python releases. Replace all instances with get_running_loop(), which is the correct call for async methods and functions that are guaranteed to run within an event loop.

This issue appears in:

  • Service files (in warmup() and transcribe() methods): faster_whisper, vibevoice, transformers, nemo
  • client.py (timing calls in async functions like run_mic_transcription())

All 15 occurrences should be updated consistently.

♻️ Example fix pattern
 async def warmup(self) -> None:
     """Initialize and warm up the model."""
     logger.info(f"Initializing faster-whisper with model: {self.model_id}")
 
     # Load model (runs in thread pool to not block)
-    loop = asyncio.get_event_loop()
+    loop = asyncio.get_running_loop()
     self.transcriber = FasterWhisperTranscriber(self.model_id)
     await loop.run_in_executor(None, self.transcriber.load_model)

For timing calls in async context:

-    recording_start_time = asyncio.get_event_loop().time()
+    recording_start_time = asyncio.get_running_loop().time()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def warmup(self) -> None:
"""Initialize and warm up the model."""
logger.info(f"Initializing faster-whisper with model: {self.model_id}")
# Load model (runs in thread pool to not block)
loop = asyncio.get_event_loop()
self.transcriber = FasterWhisperTranscriber(self.model_id)
await loop.run_in_executor(None, self.transcriber.load_model)
async def warmup(self) -> None:
"""Initialize and warm up the model."""
logger.info(f"Initializing faster-whisper with model: {self.model_id}")
# Load model (runs in thread pool to not block)
loop = asyncio.get_running_loop()
self.transcriber = FasterWhisperTranscriber(self.model_id)
await loop.run_in_executor(None, self.transcriber.load_model)
🤖 Prompt for AI Agents
In `@extras/asr-services/providers/faster_whisper/service.py` around lines 48 -
55, Replace deprecated asyncio.get_event_loop() calls with
asyncio.get_running_loop() in async contexts: update the warmup() method in
faster_whisper.service (where loop = asyncio.get_event_loop() is used before
await loop.run_in_executor(...) and FasterWhisperTranscriber.load_model is
invoked), any transcribe() async methods in service classes, and async timing
functions like run_mic_transcription() in client.py; ensure you call
asyncio.get_running_loop() and keep using loop.run_in_executor(...) for
offloading blocking work so the code runs without DeprecationWarnings or future
errors.

Comment on lines +92 to +96
async with self._lock:
with torch.no_grad():
results = self.model.transcribe(
[audio_file_path], batch_size=1, timestamps=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Blocking GPU inference on the asyncio event loop.

self.model.transcribe() is a synchronous, CPU/GPU-bound NeMo call. Wrapping it in async with self._lock does not offload it — it blocks the entire event loop during inference, making the FastAPI service unresponsive (health checks, concurrent requests all stall).

Compare with TransformersService.transcribe (in providers/transformers/service.py, lines 88-96) which correctly uses loop.run_in_executor(None, lambda: self.transcriber.transcribe(...)).

Either make transcribe synchronous and have the caller (NemoService) wrap it in run_in_executor, or offload the blocking call here with asyncio.get_event_loop().run_in_executor().

🐛 Proposed fix: offload blocking inference to a thread
     async def transcribe(self, audio_file_path: str) -> TranscriptionResult:
         ...
         if not self._is_loaded or self.model is None:
             raise RuntimeError("Model not loaded. Call load_model() first.")
 
         logger.info(f"Transcribing: {audio_file_path}")
 
+        loop = asyncio.get_event_loop()
         async with self._lock:
-            with torch.no_grad():
-                results = self.model.transcribe(
-                    [audio_file_path], batch_size=1, timestamps=True
-                )
+            results = await loop.run_in_executor(
+                None,
+                lambda: self._transcribe_sync(audio_file_path),
+            )
 
         ...
+
+    def _transcribe_sync(self, audio_file_path: str):
+        """Run NeMo transcription synchronously (intended for executor)."""
+        with torch.no_grad():
+            return self.model.transcribe(
+                [audio_file_path], batch_size=1, timestamps=True
+            )
🤖 Prompt for AI Agents
In `@extras/asr-services/providers/nemo/transcriber.py` around lines 92 - 96, The
current async transcribe method in transcriber.py is calling the blocking NeMo
call self.model.transcribe(...) inside async with self._lock, which blocks the
event loop; change it to offload the blocking call to a thread using
asyncio.get_event_loop().run_in_executor (or loop.run_in_executor) so the event
loop stays responsive. Keep the async lock for concurrency control but use await
loop.run_in_executor(None, lambda: self.model.transcribe([audio_file_path],
batch_size=1, timestamps=True)) (or make transcribe synchronous and let
NemoService call run_in_executor); ensure the torch.no_grad() context and any
GPU device setup are executed inside the executor task so GPU/CPU work does not
run on the event loop.

Comment on lines +19 to +43
FROM python:3.12-slim-bookworm AS runtime
ENV PYTHONUNBUFFERED=1
WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl ffmpeg git \
&& rm -rf /var/lib/apt/lists/*

# Copy virtual environment and application
COPY --from=builder /app /app
COPY common/ ./common/
COPY providers/transformers/ ./providers/transformers/

ENV PATH="/app/.venv/bin:$PATH"

# Default environment variables
ENV ASR_MODEL="openai/whisper-large-v3"
ENV USE_FLASH_ATTENTION="false"
ENV DEVICE="cuda"
ENV TORCH_DTYPE="float16"

EXPOSE 8765

CMD ["python", "-m", "providers.transformers.service", "--port", "8765"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all Dockerfiles in asr-services providers
echo "=== All provider Dockerfiles ==="
fd Dockerfile extras/asr-services/providers/ 2>/dev/null || echo "No Dockerfiles found"

echo ""
echo "=== Checking for USER directive ==="
fd Dockerfile extras/asr-services/providers/ -x sh -c '
  echo "File: {}"
  if grep -q "^USER" {}; then
    echo "  ✓ Has USER directive"
    grep "^USER" {}
  else
    echo "  ✗ Missing USER directive"
  fi
  echo ""
' 2>/dev/null

Repository: SimpleOpenSoftware/chronicle

Length of output: 691


🏁 Script executed:

#!/bin/bash
cat -n extras/asr-services/providers/transformers/Dockerfile

Repository: SimpleOpenSoftware/chronicle

Length of output: 1605


🏁 Script executed:

#!/bin/bash
# Check service code for git dependencies or runtime model downloads
echo "=== Service entry point ==="
cat -n extras/asr-services/providers/transformers/service.py 2>/dev/null | head -50

echo ""
echo "=== Check for git-related imports or huggingface_hub clone usage ==="
rg -l "subprocess|git|GIT_CLONE|clone_repo" extras/asr-services/providers/transformers/ 2>/dev/null || echo "No direct git usage found"

echo ""
echo "=== Check pyproject.toml for huggingface_hub or transformers extras ==="
rg "huggingface|git|transformers" pyproject.toml 2>/dev/null | head -20

Repository: SimpleOpenSoftware/chronicle

Length of output: 2031


🏁 Script executed:

#!/bin/bash
echo "=== Transcriber implementation ==="
cat -n extras/asr-services/providers/transformers/transcriber.py 2>/dev/null | head -80

echo ""
echo "=== Full pyproject.toml ==="
cat pyproject.toml 2>/dev/null | grep -A 50 "tool.uv"

Repository: SimpleOpenSoftware/chronicle

Length of output: 3322


🏁 Script executed:

#!/bin/bash
echo "=== Check pyproject.toml dependencies ==="
cat pyproject.toml | grep -E "(transformers|huggingface|torch|git)" -A 1 -B 1

echo ""
echo "=== Check if HuggingFace models require git in transformers provider ==="
# transformers library uses HTTP by default, not git
# Check if any model in the provider would need git (e.g., custom git-based models)
rg "git|clone|subprocess" extras/asr-services/providers/transformers/ --type py

Repository: SimpleOpenSoftware/chronicle

Length of output: 251


🏁 Script executed:

#!/bin/bash
# Find pyproject.toml
find . -name "pyproject.toml" -type f 2>/dev/null

echo ""
echo "=== Check if git is actually needed in runtime ==="
# HuggingFace Transformers uses HTTP downloads by default, not git
# Let's verify the model loading code more thoroughly
cat -n extras/asr-services/providers/transformers/transcriber.py | sed -n '65,150p'

Repository: SimpleOpenSoftware/chronicle

Length of output: 3737


Add a non-root USER to the runtime stage and remove unnecessary git dependency.

Running the container as root is flagged by Trivy (DS002) and violates security best practices. Additionally, git is installed in the runtime stage (line 25) but is unnecessary—HuggingFace Transformers downloads models via HTTP by default using from_pretrained(), not git clone.

This issue is systemic across all provider Dockerfiles: transformers, nemo, faster_whisper, and vibevoice all lack the USER directive. Consider fixing consistently across all of them.

🛡️ Proposed fix
 # Copy virtual environment and application
 COPY --from=builder /app /app
 COPY common/ ./common/
 COPY providers/transformers/ ./providers/transformers/
 
 ENV PATH="/app/.venv/bin:$PATH"
 
+# Run as non-root user
+RUN useradd --create-home appuser
+USER appuser
+
 # Default environment variables
 ENV ASR_MODEL="openai/whisper-large-v3"

And simplify line 24–26:

 RUN apt-get update && apt-get install -y --no-install-recommends \
-        curl ffmpeg git \
+        curl ffmpeg \
     && rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
In `@extras/asr-services/providers/transformers/Dockerfile` around lines 19 - 43,
The runtime Dockerfile is running as root and installs an unnecessary git
package; create and switch to a non-root user in the runtime stage (e.g., add a
user/group like "app" and use USER app) and remove "git" from the apt-get
install list in the runtime stage where curl and ffmpeg are installed; ensure
the created user's home and /app ownership match ENV PATH usage and any copied
files so the app can run without root, and apply the same USER + removal of git
change consistently in the other provider Dockerfiles (transformers, nemo,
faster_whisper, vibevoice).

Comment on lines +97 to 103
transformers = [
"transformers>=4.40.0",
"accelerate>=0.30.0",
"diffusers>=0.30.0",
"torch>=2.3",
"torchaudio>=2.3",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for diffusers imports in Python files within extras/asr-services
find extras/asr-services -type f -name "*.py" | head -20

Repository: SimpleOpenSoftware/chronicle

Length of output: 1008


🏁 Script executed:

# Search for any reference to diffusers in the entire extras/asr-services directory
rg 'diffusers' extras/asr-services/ -i

Repository: SimpleOpenSoftware/chronicle

Length of output: 1155


🏁 Script executed:

# Get the structure of the asr-services directory
tree -L 3 extras/asr-services/ 2>/dev/null || find extras/asr-services -type f -name "*.py" | head -20

Repository: SimpleOpenSoftware/chronicle

Length of output: 1240


🏁 Script executed:

# Check transformers provider
cat -n extras/asr-services/providers/transformers/transcriber.py | head -30

Repository: SimpleOpenSoftware/chronicle

Length of output: 1179


🏁 Script executed:

# Check vibevoice provider
cat -n extras/asr-services/providers/vibevoice/service.py | head -30

Repository: SimpleOpenSoftware/chronicle

Length of output: 1015


🏁 Script executed:

# Check vibevoice transcriber implementation
cat -n extras/asr-services/providers/vibevoice/transcriber.py | head -50

Repository: SimpleOpenSoftware/chronicle

Length of output: 2131


🏁 Script executed:

# Search for actual imports in the transformers provider
grep -n "^import\|^from" extras/asr-services/providers/transformers/transcriber.py

Repository: SimpleOpenSoftware/chronicle

Length of output: 220


🏁 Script executed:

# Search for actual imports in the vibevoice provider
grep -n "^import\|^from" extras/asr-services/providers/vibevoice/transcriber.py

Repository: SimpleOpenSoftware/chronicle

Length of output: 316


🏁 Script executed:

# Check full transformers transcriber for any diffusers usage
cat extras/asr-services/providers/transformers/transcriber.py | grep -i "diffusers\|import"

Repository: SimpleOpenSoftware/chronicle

Length of output: 291


🏁 Script executed:

# Check full vibevoice transcriber for any diffusers usage
cat extras/asr-services/providers/vibevoice/transcriber.py | grep -i "diffusers\|import"

Repository: SimpleOpenSoftware/chronicle

Length of output: 785


🏁 Script executed:

# Check if transformers library depends on diffusers by looking at PyPI
web_search "does transformers library depend on diffusers as transitive dependency"

Repository: SimpleOpenSoftware/chronicle

Length of output: 122


🌐 Web query:

transformers package dependencies does it include diffusers

💡 Result:

No—installing transformers does not install diffusers as a required dependency.

  • In transformers, diffusers is not part of the core install_requires (required dependencies). (fossies.org)
  • diffusers only shows up under an optional extra (e.g., an “agents” extra in the setup configuration), meaning you only get it if you explicitly install that extra. (fossies.org)
  • The PyPI project metadata for transformers lists many Provides-Extra options, but diffusers is not a base requirement. (pypi.org)

Citations:


Remove diffusers>=0.30.0 from both the transformers and vibevoice dependency groups—it is unused.

diffusers is not imported or used anywhere in the transformers or vibevoice provider implementations. The transformers library does not list diffusers as a required or transitive dependency (it only appears in optional extras). Removing this from both groups (lines 100 and 114) will eliminate unnecessary bloat from the installation.

🤖 Prompt for AI Agents
In `@extras/asr-services/pyproject.toml` around lines 97 - 103, Remove the unused
"diffusers>=0.30.0" entry from both dependency groups in pyproject.toml: delete
the "diffusers>=0.30.0" line under the transformers array and likewise remove it
from the vibevoice dependency group so neither the transformers nor vibevoice
lists include diffusers.

Comment on lines +90 to +126
ASR Transcribe Response Words Have Timestamps
[Documentation] Verify word-level timestamps in transcription response
[Tags] audio-upload

${response}= Upload Audio For ASR Transcription ${TEST_AUDIO_FILE}
${json}= Set Variable ${response.json()}

# Verify words list is not empty
Should Not Be Empty ${json}[words]

# Verify first word has required timestamp fields
${first_word}= Get From List ${json}[words] 0
Dictionary Should Contain Key ${first_word} word
Dictionary Should Contain Key ${first_word} start
Dictionary Should Contain Key ${first_word} end

# Verify timestamps are numeric and ordered correctly
Should Be True isinstance($first_word['start'], (int, float))
Should Be True isinstance($first_word['end'], (int, float))
Should Be True ${first_word}[start] < ${first_word}[end]
Should Be True ${first_word}[start] >= 0

ASR Transcribe Response Words Have Confidence Scores
[Documentation] Verify confidence scores in word-level transcription
[Tags] audio-upload

${response}= Upload Audio For ASR Transcription ${TEST_AUDIO_FILE}
${json}= Set Variable ${response.json()}

# Verify first word has confidence
${first_word}= Get From List ${json}[words] 0
Dictionary Should Contain Key ${first_word} confidence

# Verify confidence is between 0 and 1
Should Be True isinstance($first_word['confidence'], (int, float))
Should Be True ${first_word}[confidence] >= 0
Should Be True ${first_word}[confidence] <= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Word-level tests will fail for providers without word_timestamps capability (e.g., VibeVoice).

Tests "ASR Transcribe Response Words Have Timestamps" (line 90) and "ASR Transcribe Response Words Have Confidence Scores" (line 112) unconditionally assert that the words list is non-empty. However, VibeVoice explicitly does not provide word-level timestamps. The segment and diarization tests (lines 176–224) correctly guard on capability first — these should follow the same pattern.

Proposed fix for the timestamps test (apply similarly to confidence test)
 ASR Transcribe Response Words Have Timestamps
     [Documentation]    Verify word-level timestamps in transcription response
     [Tags]    audio-upload
 
+    ${info}=    Get ASR Service Info    ${ASR_URL}
+    ${has_word_timestamps}=    Evaluate    'word_timestamps' in $info['capabilities']
+    Skip If    not ${has_word_timestamps}    Provider does not have word_timestamps capability
+
     ${response}=    Upload Audio For ASR Transcription    ${TEST_AUDIO_FILE}
     ${json}=    Set Variable    ${response.json()}
🤖 Prompt for AI Agents
In `@tests/asr/protocol_tests.robot` around lines 90 - 126, The tests
unconditionally assert on ${json}[words] but must skip when the provider lacks
the word_timestamps capability; add a guard at the start of both test cases (ASR
Transcribe Response Words Have Timestamps and ASR Transcribe Response Words Have
Confidence Scores) that checks the provider capability (e.g., inspect
${capabilities}[word_timestamps] or call an existing capability-check keyword)
and uses "Skip Test" or "Run Keyword Unless" to skip the test when
word_timestamps is false, then keep the existing assertions on ${json}[words],
${first_word}, start/end and confidence as-is; apply the same conditional
pattern used by the segment/diarization tests to both test cases.

Comment on lines +159 to +174
ASR Capabilities Are From Known Set
[Documentation] Verify reported capabilities are valid known capabilities
... Known capabilities: timestamps, word_timestamps, segments, diarization
[Tags] infra

${info}= Get ASR Service Info ${ASR_URL}

# Define known capabilities
@{known_caps}= Create List timestamps word_timestamps segments diarization

# All reported capabilities should be known
FOR ${cap} IN @{info}[capabilities]
${is_known}= Evaluate $cap in $known_caps
Should Be True ${is_known}
... Unknown capability reported: ${cap}. Known: ${known_caps}
END
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find service.py file
find . -name "service.py" -o -name "*service*.py" | head -20

Repository: SimpleOpenSoftware/chronicle

Length of output: 1088


🏁 Script executed:

#!/bin/bash
# Search for get_capabilities method implementations across the codebase
rg -n "def get_capabilities" --type=py -A 10

Repository: SimpleOpenSoftware/chronicle

Length of output: 4465


🏁 Script executed:

#!/bin/bash
# Also search for classes that might implement ASR services
rg -n "class.*Service" --type=py | grep -i "asr\|whisper\|transformer"

Repository: SimpleOpenSoftware/chronicle

Length of output: 663


Known capabilities set is incomplete — will reject valid provider capabilities.

The known set on line 167 is [timestamps, word_timestamps, segments, diarization], but the providers return additional capabilities:

  • FasterWhisperService returns: language_detection, vad_filter, translation
  • TransformersService returns: language_detection
  • NemoService returns: chunked_processing
  • VibeVoiceService returns: speaker_identification, long_form

The test will fail for any of these providers unless the known capabilities list includes all their supported capabilities.

Proposed fix: extend the known capabilities list
     # Define known capabilities
-    @{known_caps}=    Create List    timestamps    word_timestamps    segments    diarization
+    @{known_caps}=    Create List    timestamps    word_timestamps    segments    diarization    language_detection    vad_filter    translation    chunked_processing    speaker_identification    long_form
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ASR Capabilities Are From Known Set
[Documentation] Verify reported capabilities are valid known capabilities
... Known capabilities: timestamps, word_timestamps, segments, diarization
[Tags] infra
${info}= Get ASR Service Info ${ASR_URL}
# Define known capabilities
@{known_caps}= Create List timestamps word_timestamps segments diarization
# All reported capabilities should be known
FOR ${cap} IN @{info}[capabilities]
${is_known}= Evaluate $cap in $known_caps
Should Be True ${is_known}
... Unknown capability reported: ${cap}. Known: ${known_caps}
END
ASR Capabilities Are From Known Set
[Documentation] Verify reported capabilities are valid known capabilities
... Known capabilities: timestamps, word_timestamps, segments, diarization
[Tags] infra
${info}= Get ASR Service Info ${ASR_URL}
# Define known capabilities
@{known_caps}= Create List timestamps word_timestamps segments diarization language_detection vad_filter translation chunked_processing speaker_identification long_form
# All reported capabilities should be known
FOR ${cap} IN @{info}[capabilities]
${is_known}= Evaluate $cap in $known_caps
Should Be True ${is_known}
... Unknown capability reported: ${cap}. Known: ${known_caps}
END
🤖 Prompt for AI Agents
In `@tests/asr/protocol_tests.robot` around lines 159 - 174, The test "ASR
Capabilities Are From Known Set" is using an incomplete known capabilities list
in the variable @{known_caps}, causing valid provider capabilities to be treated
as unknown; update @{known_caps} to include the additional capabilities returned
by providers (add language_detection, vad_filter, translation,
chunked_processing, speaker_identification, long_form) so the assertion over
${info}[capabilities] only fails for truly unknown entries. Ensure the list in
the test matches all supported capability strings returned by
FasterWhisperService, TransformersService, NemoService, and VibeVoiceService.

Comment on lines +43 to +67
# Response models
class HealthResponse(BaseModel):
status: str
model: str
provider: str


class InfoResponse(BaseModel):
model_id: str
provider: str
capabilities: list[str]
supported_languages: Optional[list[str]] = None


class WordTimestamp(BaseModel):
word: str
start: float
end: float
confidence: float


class TranscriptionResult(BaseModel):
text: str
words: list[WordTimestamp]
segments: list[dict]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Mock response models diverge from production models in common/response_models.py.

The mock defines its own HealthResponse, InfoResponse, WordTimestamp, and TranscriptionResult that differ from the canonical models:

Field Production (common/response_models.py) Mock
Word.confidence Optional[float] float (required)
TranscriptionResult.segments List[Segment] (typed model) list[dict] (untyped)
TranscriptionResult.speakers Optional[List[Speaker]] absent
TranscriptionResult.language Optional[str] absent
TranscriptionResult.duration Optional[float] absent

Tests passing against this mock may not catch schema mismatches with the real service (e.g., missing speakers/language/duration fields, or confidence being null). Consider importing and reusing the models from common/response_models.py, or at minimum aligning the mock's TranscriptionResult shape to include optional fields that the production API returns.

🤖 Prompt for AI Agents
In `@tests/libs/mock_asr_server.py` around lines 43 - 67, The mock response models
(HealthResponse, InfoResponse, WordTimestamp, TranscriptionResult) must match
the canonical models: either import and reuse the classes from
common/response_models.py or update the mock definitions so
WordTimestamp.confidence is Optional[float], TranscriptionResult.segments is
typed as List[Segment] (not list[dict]), and include optional fields speakers:
Optional[List[Speaker]], language: Optional[str], and duration: Optional[float];
update the mock module to reference the canonical Segment and Speaker types (or
import them) so tests exercise the same schema as production.

Comment on lines +88 to +103
@{env_vars}= Create List
IF "${model}" != "${EMPTY}"
Append To List ${env_vars} ASR_MODEL=${model}
END
Append To List ${env_vars} ASR_PORT=${port}

Log To Console \n🚀 Starting ${service} on port ${port}...
IF "${model}" != "${EMPTY}"
Log To Console Model: ${model}
END

# Start the service
${result}= Run Process docker compose up -d --build ${service}
... cwd=${asr_dir}
... env:ASR_MODEL=${model}
... env:ASR_PORT=${port}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

env:ASR_MODEL=${model} passes an empty string when no model is specified, overriding docker-compose defaults.

When Start GPU ASR Service is called without ${model} (defaults to ${EMPTY}), line 102 sets ASR_MODEL= as an empty string in the subprocess environment. Docker Compose's ${ASR_MODEL:-default} only applies the default when the variable is unset, not when it's set to empty — so the service would receive an empty model ID, likely causing a startup failure.

Additionally, the @{env_vars} list built on lines 88-92 is never passed to Run Process — it's dead code.

🐛 Proposed fix: only pass ASR_MODEL when non-empty
     # Start the service
+    IF    "${model}" != "${EMPTY}"
     ${result}=    Run Process    docker    compose    up    -d    --build    ${service}
     ...    cwd=${asr_dir}
     ...    env:ASR_MODEL=${model}
     ...    env:ASR_PORT=${port}
+    ELSE
+    ${result}=    Run Process    docker    compose    up    -d    --build    ${service}
+    ...    cwd=${asr_dir}
+    ...    env:ASR_PORT=${port}
+    END

And remove the unused @{env_vars} construction (lines 88-92).

🤖 Prompt for AI Agents
In `@tests/resources/asr_keywords.robot` around lines 88 - 103, The subprocess is
setting ASR_MODEL to an empty string which overrides docker-compose defaults and
the @{env_vars} list is unused; update the Start GPU ASR Service invocation so
you remove the dead @{env_vars} construction and only include ASR_MODEL in the
env passed to Run Process when "${model}" is not empty (keep env:ASR_PORT
always), i.e., conditionally add the ASR_MODEL environment entry in the Run
Process call instead of unconditionally using env:ASR_MODEL=${model} and delete
the unused @{env_vars} lines.

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

⚠️ Robot Framework Test Results (Label-Triggered Full Suite)

Status: ❌ Some tests failed

🏷️ Note: This run was triggered by the test-with-api-keys label.
All tests including API-dependent tests have been executed.

Metric Count
✅ Passed 124
❌ Failed 19
📊 Total 143

📊 View Reports

GitHub Pages (Live Reports):

Download Artifacts:


View full workflow run

@AnkushMalaker AnkushMalaker deleted the feat/vibevoice-asr-2 branch February 22, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants