support timed transcripts from tts#2580
Conversation
✅ Changeset File DetectedThe following changeset entries were found:
Change description: |
| tasks.append(tts_task) | ||
| if ( | ||
| (tts := self.tts) | ||
| and (tts.capabilities.timed_transcript or not tts.capabilities.streaming) |
There was a problem hiding this comment.
I have a concern here if user created new AudioFrame in a customized tts_node but didn't forward the timed transcripts, we may missing the text response. wdyt? @theomonnom
| def transcription_node( | ||
| self, text: AsyncIterable[str], model_settings: ModelSettings | ||
| ) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]: | ||
| self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings |
There was a problem hiding this comment.
TimedString is a str, to be fair I'm not even sure if we should encourage people to use the timed transcripts here
| self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings | |
| self, text: AsyncIterable[str], model_settings: ModelSettings |
There was a problem hiding this comment.
do you have any suggestions for the alternative for ppl to access the timed transcripts?
There was a problem hiding this comment.
IMO we should just keep it implicit for now
There was a problem hiding this comment.
IMO we should just keep it implicit for now
I understand the concern this may change in the future, but we should expose the timed transcripts to user in some way IMO, this was asked by some folks for awhile. Any alternatives for this?
| return | ||
|
|
||
| if last_frame is not None: | ||
| last_frame.user_data["timed_transcripts"] = timed_transcripts |
There was a problem hiding this comment.
Is the idea to send the timed transcripts at the end of segment?
if so maybe this could just be a new field to SynthesizedAudio. (This also means that we don't have synchronized transcripts until the whole generation is done?)
There was a problem hiding this comment.
It's not at the end of segment, usually at the start of the tts.
There was a problem hiding this comment.
the buffered timed_transcripts are always added to the next audio frame, not only the last frame.
|
just adding my support for this one, excited to see you hopefully get this out soon! |
| timed_texts_fut.set_result(timed_text_ch) | ||
|
|
||
| async for audio_frame in tts_node: | ||
| for text in audio_frame.userdata.get("timed_transcripts", []): |
There was a problem hiding this comment.
Let's move the key as a constant somewhere. Like https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/types.py
(with lk. prefix)
| return url | ||
|
|
||
|
|
||
| def _to_timed_words( |
There was a problem hiding this comment.
It's a bit difficult to understand this code without actually running it. Not urgent, but it would be ideal to have some tests in place at some point
|
Should we release a new version of livekit-rtc before merging? |
| class TTSCapabilities: | ||
| streaming: bool | ||
| """Whether this TTS supports streaming (generally using websockets)""" | ||
| timed_transcript: bool = False |
There was a problem hiding this comment.
we use aligned_transcript on other parts of the code (AgentSession has use_tts_aligned_transcript). Let's use aligned here too?
This reverts commit b4104c3.
when is the EST for this to be added? |
| def __init__(self): | ||
| super().__init__(instructions="You are a helpful assistant.") | ||
|
|
||
| self._closing_task: asyncio.Task[None] | None = None |
There was a problem hiding this comment.
| self._closing_task: asyncio.Task[None] | None = None |
| self._wrapped_tts = tts | ||
| self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer() | ||
| self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer( | ||
| retain_format=True |
There was a problem hiding this comment.
Oh actually we were not using retain_format for the StreamAdapter before. Since it is only used to generate a sentence.
In the PR I did, I was actually keeping the basic.SentenceTokenizer inside the transcription synchronization code.
There was a problem hiding this comment.
It was used in agent's tts_node
There was a problem hiding this comment.
or maybe I added it in this pr, we need to format if we use the timed transcript from stream adapter.
There was a problem hiding this comment.
Ah ok, the synchronizer also needs the exact same formatting?
There was a problem hiding this comment.
no, it can be different. They process the sentences separately.
There was a problem hiding this comment.
I see, but I thought the aligned transcripts returned by the TTSs were not including new lines/special characters. So I assumed retain_format was not needed.
There was a problem hiding this comment.
When using the StreamAdapter with OpenAI, the transcription_node is coming from the llm_node right?
In this case I really don't think we should wait for the TTS? Since we have the opt-in flag use_tts_aligned_transcript
There was a problem hiding this comment.
If use tts aligned transcript enabled, the input of the transcription node is from tts.
There was a problem hiding this comment.
wdym for we shouldn't wait for the tts when using steam adapter?
There was a problem hiding this comment.
Ok that makes sense, so by default, even if we use the StreamAdapter, it'll use the llm output for the transcription_node
| def transcription_node( | ||
| self, text: AsyncIterable[str], model_settings: ModelSettings | ||
| ) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]: | ||
| self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings |
There was a problem hiding this comment.
IMO we should just keep it implicit for now
Fix #2607, #2326 (comment)
needs livekit/python-sdks#456