Skip to content

Conversation

@manas-narra
Copy link
Collaborator

@manas-narra manas-narra commented Dec 31, 2025

  • Prioritize regional languages when mixed with English words

Summary by CodeRabbit

  • Bug Fixes

    • Improved text-to-speech language detection to prioritize regional Indian language scripts when present in mixed-language input.
  • Chores

    • Enhanced logging filters to exclude additional binary data patterns from application logs.

✏️ Tip: You can customize this high-level summary in your review settings.

  - Prioritize regional languages when mixed with English words
@coderabbitai
Copy link

coderabbitai bot commented Dec 31, 2025

Walkthrough

Two targeted logic improvements were made: the script detection function now prioritizes regional Indian scripts over English, returning any detected regional script immediately; meanwhile, logging spam filters were expanded to exclude additional binary data patterns from specific loggers.

Changes

Cohort / File(s) Summary
Script Detection Enhancement
app/ai/voice/tts/sarvam.py
Modified detect_script() function to prioritize regional Indian scripts (Telugu, Devanagari, Tamil, etc.) over English. Now returns a regional script immediately upon detection, rather than selecting based on highest character count. Only returns English when no regional scripts are present. Improves handling of mixed-language input.
Logging Spam Filter Expansion
app/core/logger/__init__.py
Extended spam-filter logic to block additional logging patterns: filters records where logger name starts with "logging" and message begins with "> BINARY" or "< BINARY". Applied across both production and development paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A script dance unfolds so bright,
Regional voices take their flight!
Telugu, Tamil, standing tall,
No more English drowns them all—
And filters hush the binary chatter,
Making logs that truly matter!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: prioritizing regional scripts over English in Sarvam TTS detection logic, which aligns with the primary modification in app/ai/voice/tts/sarvam.py.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Sarvam TTS language detection to prioritize regional Indian languages over English when both scripts are present in the text, addressing the common pattern of mixing English words with regional languages in Indian speech.

Key Changes:

  • Modified script detection logic to prioritize any regional language characters over English
  • Added logging filters to suppress verbose binary websocket messages

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
app/ai/voice/tts/sarvam.py Updated detect_script() function to return regional language when any regional script is detected, even if mixed with English words
app/core/logger/__init__.py Added filters to suppress "> BINARY" and "< BINARY" log messages from websocket logging spam

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# If ANY regional language is detected, use it (prioritize over English)
# This handles mixed text like "హలో Manas sir!" → Telugu
if regional_script and max_regional_count > 0:
Copy link

Copilot AI Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if regional_script and max_regional_count > 0 is redundant. If max_regional_count > 0, then regional_script will always be set to a non-None value based on the loop logic in lines 78-81. The check for regional_script being truthy is unnecessary and can be simplified to just if max_regional_count > 0.

Suggested change
if regional_script and max_regional_count > 0:
if max_regional_count > 0:

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
app/ai/voice/tts/sarvam.py (1)

85-85: Optional: Simplify the conditional check.

The condition regional_script and max_regional_count > 0 includes a redundant check. If max_regional_count > 0, then regional_script must be non-None (it was set in the loop). The simplified condition max_regional_count > 0 would be sufficient.

🔎 Proposed simplification
-        if regional_script and max_regional_count > 0:
+        if max_regional_count > 0:
             return regional_script
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef71043 and a18aef7.

📒 Files selected for processing (2)
  • app/ai/voice/tts/sarvam.py
  • app/core/logger/__init__.py
🧰 Additional context used
🪛 Ruff (0.14.10)
app/ai/voice/tts/sarvam.py

93-93: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (3)
app/core/logger/__init__.py (1)

81-82: LGTM: Binary data spam filters added correctly.

The new filter conditions for "> BINARY" and "< BINARY" message prefixes are consistent with existing spam filtering patterns and will effectively reduce logging noise.

app/ai/voice/tts/sarvam.py (2)

54-59: Clear documentation of the new prioritization behavior.

The updated docstring accurately describes the regional-first detection logic and provides a helpful example.


75-93: No breaking changes identified.

The behavioral change is safe. The detect_script function has only one consumer (_switch_language_if_needed), which uses SCRIPT_TO_SARVAM_LANG.get(detected_script, "en-IN") with a fallback, so it's protected against any return value. The mapping covers all 10 possible script return values (9 regional scripts + "english"), and the implementation is correct and well-documented.

Note: There are currently no unit tests for detect_script, so there are no existing tests to break.

@manas-narra
Copy link
Collaborator Author

Not needed anymore, using cartesia TTS for regional languages now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant