Skip to content

feat(openai): support TTS and Audio models#189

Draft
CorieW wants to merge 6 commits intogenkit-ai:mainfrom
invertase:openai-add-tts-and-audio-models
Draft

feat(openai): support TTS and Audio models#189
CorieW wants to merge 6 commits intogenkit-ai:mainfrom
invertase:openai-add-tts-and-audio-models

Conversation

@CorieW
Copy link
Collaborator

@CorieW CorieW commented Mar 2, 2026

Resolves #168

This PR adds support for TTS models (e.g. gpt-4o-mini-tts) and audio models (e.g. gpt-4o-audio-preview) in the OpenAI plugin.

Testing:

  • TTS model
image
  • Audio model
image

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the OpenAI plugin's capabilities by integrating support for Text-to-Speech (TTS) and chat models that can generate audio responses. Users can now leverage OpenAI's speech synthesis features to convert text into spoken audio and engage with chat models that provide audio as part of their output. This enhancement provides a richer, more interactive experience by enabling new forms of multimodal communication with OpenAI models.

Highlights

  • OpenAI Text-to-Speech (TTS) Support: Implemented full support for OpenAI's Text-to-Speech models, allowing users to generate audio from text input with configurable voices and formats. This includes a new dedicated handler for TTS requests.
  • OpenAI Chat Audio Model Support: Added support for OpenAI chat models that can return audio responses, enabling multimodal interactions where models can speak back to the user. This involves new responseModalities options and audio aggregation logic for streaming responses.
  • New Configuration Options: Introduced new OpenAIOptions fields: responseModalities to specify desired output types (text, audio), audioVoice for selecting the speech voice, and audioFormat for choosing the audio output format.
  • Enhanced Model Information and Type Detection: Updated model information helpers to correctly identify and categorize 'audio' and 'tts' model types, ensuring appropriate handling and capabilities are exposed for these new model categories.
  • Streaming Audio Aggregation: Developed new internal logic to correctly aggregate fragmented audio data received from streaming chat completion responses, ensuring complete audio output is reconstructed.
  • Documentation and Examples: Updated the README.md with usage examples for both TTS and chat audio models, and added new example flows in example.dart and dedicated sample applications to demonstrate these features.
Changelog
  • packages/genkit_openai/README.md
    • Updated documentation with new sections and code examples for Text-to-Speech and Chat Audio Models.
    • Expanded the OpenAIOptions description to include new audio-related parameters.
  • packages/genkit_openai/example/example.dart
    • Added new example flows for textToSpeech and chatAudioModels to demonstrate new functionalities.
  • packages/genkit_openai/lib/genkit_openai.dart
    • Exported new model info functions (audioModelInfo, ttsModelInfo).
    • Added responseModalities, audioVoice, and audioFormat properties to OpenAIOptions.
  • packages/genkit_openai/lib/genkit_openai.g.dart
    • Updated the generated OpenAIOptions class to include new audio-related fields and their serialization/deserialization logic.
  • packages/genkit_openai/lib/src/aggregation.dart
    • Introduced _AudioAccumulator to aggregate audio chunks from streaming responses.
    • Modified aggregateStreamResponses to process and accumulate audio deltas.
  • packages/genkit_openai/lib/src/converters.dart
    • Modified fromOpenAIAssistantMessage to convert audio payloads into MediaPart objects.
    • Added a helper function _audioMimeType to determine the correct MIME type for audio formats.
  • packages/genkit_openai/lib/src/models.dart
    • Added audioModelInfo and ttsModelInfo functions to define capabilities for audio generation models.
    • Updated model support checks to include new audio model types.
  • packages/genkit_openai/lib/src/openai_plugin.dart
    • Imported dart:convert and openai_dart/src/generated/client.dart for audio handling.
    • Added utility functions resolveOpenAIModalities, isSpeechSynthesisModel, _parseOpenAIModality, _extractSpeechInputText, _speechFormatToApiValue, _speechFormatToMimeType, and _resolveSpeechContentType for audio processing.
    • Updated getModelType to differentiate between 'audio', 'tts', and 'stt' models.
    • Modified _getModelInfo to return appropriate ModelInfo for audio and TTS models.
    • Adjusted model registration to include 'audio' and 'tts' types.
    • Implemented _handleSpeechSynthesis for dedicated TTS model requests.
    • Modified _handleStreaming and _handleNonStreaming to support audio responses from chat models, including passing audioFormat for conversion.
  • packages/genkit_openai/test/aggregation_test.dart
    • Added _audioChunk helper for creating audio stream chunks in tests.
    • Added a test case to verify aggregation of streamed audio fragments.
  • packages/genkit_openai/test/integration_test.dart
    • Added an integration test for text-to-speech generation.
  • packages/genkit_openai/test/openai_plugin_test.dart
    • Imported isSpeechSynthesisModel and resolveOpenAIModalities for testing.
    • Added tests for OpenAIOptions parsing of audio-related fields.
    • Added tests for resolveOpenAIModalities covering default behavior, normalization, and error handling.
    • Added tests for isSpeechSynthesisModel to correctly identify TTS models.
    • Added a test for GenkitConverter.fromOpenAIAssistantMessage to verify audio payload conversion.
    • Added tests for audioModelInfo and ttsModelInfo to confirm correct model support settings.
  • testapps/openai_sample/lib/chat_audio.dart
    • Added new file defining a sample flow for generating speech audio using OpenAI chat audio models.
  • testapps/openai_sample/lib/text_to_speech.dart
    • Added new file defining a sample flow for generating speech audio using OpenAI text-to-speech models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates support for OpenAI's Text-to-Speech (TTS) and chat audio models. A critical security vulnerability has been identified where the TTS handler hardcodes the OpenAI API base URL, bypassing custom baseUrl configurations and potentially leading to unintended data leakage. Additionally, there are opportunities for general code improvement and test corrections to enhance robustness and maintainability.

final responseFormat = _speechFormatToApiValue(format);
final requestedMimeType = _speechFormatToMimeType(format);

// ignore: invalid_use_of_protected_member
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support TTS on openai

1 participant