support timed transcripts from tts by longcw · Pull Request #2580 · livekit/agents

longcw · 2025-06-12T07:54:31Z

Fix #2607, #2326 (comment)
needs livekit/python-sdks#456

…tion

github-actions · 2025-06-12T07:54:44Z

✅ Changeset File Detected

The following changeset entries were found:

patch - livekit-agents
patch - livekit-plugins-cartesia
patch - livekit-plugins-elevenlabs

Change description:
support aligned transcripts with timestamps from tts (#2580)

livekit-agents/livekit/agents/voice/io.py

…tion

longcw · 2025-06-16T03:10:22Z

livekit-agents/livekit/agents/voice/agent_activity.py

                tasks.append(tts_task)
+                if (
+                    (tts := self.tts)
+                    and (tts.capabilities.timed_transcript or not tts.capabilities.streaming)


I have a concern here if user created new AudioFrame in a customized tts_node but didn't forward the timed transcripts, we may missing the text response. wdyt? @theomonnom

theomonnom · 2025-06-16T11:43:47Z

livekit-agents/livekit/agents/voice/agent.py

    def transcription_node(
-        self, text: AsyncIterable[str], model_settings: ModelSettings
-    ) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]:
+        self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings


TimedString is a str, to be fair I'm not even sure if we should encourage people to use the timed transcripts here

Suggested change

self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings

self, text: AsyncIterable[str], model_settings: ModelSettings

do you have any suggestions for the alternative for ppl to access the timed transcripts?

IMO we should just keep it implicit for now

IMO we should just keep it implicit for now

I understand the concern this may change in the future, but we should expose the timed transcripts to user in some way IMO, this was asked by some folks for awhile. Any alternatives for this?

theomonnom · 2025-06-16T11:44:54Z

livekit-agents/livekit/agents/tts/tts.py

                    return

            if last_frame is not None:
+                last_frame.user_data["timed_transcripts"] = timed_transcripts


Is the idea to send the timed transcripts at the end of segment?
if so maybe this could just be a new field to SynthesizedAudio. (This also means that we don't have synchronized transcripts until the whole generation is done?)

It's not at the end of segment, usually at the start of the tts.

the buffered timed_transcripts are always added to the next audio frame, not only the last frame.

…tion

tbachlechner · 2025-06-17T17:12:46Z

just adding my support for this one, excited to see you hopefully get this out soon!

…tion

theomonnom

lgtm! this is awesome!

theomonnom · 2025-06-19T21:16:29Z

livekit-agents/livekit/agents/voice/generation.py

+            timed_texts_fut.set_result(timed_text_ch)
+
            async for audio_frame in tts_node:
+                for text in audio_frame.userdata.get("timed_transcripts", []):


Let's move the key as a constant somewhere. Like https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/types.py

(with lk. prefix)

theomonnom · 2025-06-19T21:19:42Z

livekit-plugins/livekit-plugins-elevenlabs/livekit/plugins/elevenlabs/tts.py

    return url
+
+
+def _to_timed_words(


It's a bit difficult to understand this code without actually running it. Not urgent, but it would be ideal to have some tests in place at some point

theomonnom · 2025-06-19T21:20:29Z

Should we release a new version of livekit-rtc before merging?

theomonnom · 2025-06-19T21:21:12Z

livekit-agents/livekit/agents/tts/tts.py

 class TTSCapabilities:
    streaming: bool
    """Whether this TTS supports streaming (generally using websockets)"""
+    timed_transcript: bool = False


we use aligned_transcript on other parts of the code (AgentSession has use_tts_aligned_transcript). Let's use aligned here too?

…tion

This reverts commit b4104c3.

…tion

sansjack · 2025-06-27T12:18:58Z

Should we release a new version of livekit-rtc before merging?

when is the EST for this to be added?

…-transcription

theomonnom · 2025-07-01T10:03:02Z

examples/voice_agents/timed_agent_transcript.py

+    def __init__(self):
+        super().__init__(instructions="You are a helpful assistant.")
+
+        self._closing_task: asyncio.Task[None] | None = None


Suggested change

self._closing_task: asyncio.Task[None] | None = None

theomonnom · 2025-07-01T10:04:26Z

livekit-agents/livekit/agents/tts/stream_adapter.py

        self._wrapped_tts = tts
-        self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer()
+        self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer(
+            retain_format=True


Oh actually we were not using retain_format for the StreamAdapter before. Since it is only used to generate a sentence.

In the PR I did, I was actually keeping the basic.SentenceTokenizer inside the transcription synchronization code.

It was used in agent's tts_node

or maybe I added it in this pr, we need to format if we use the timed transcript from stream adapter.

Ah ok, the synchronizer also needs the exact same formatting?

no, it can be different. They process the sentences separately.

I see, but I thought the aligned transcripts returned by the TTSs were not including new lines/special characters. So I assumed retain_format was not needed.

When using the StreamAdapter with OpenAI, the transcription_node is coming from the llm_node right?
In this case I really don't think we should wait for the TTS? Since we have the opt-in flag use_tts_aligned_transcript

If use tts aligned transcript enabled, the input of the transcription node is from tts.

wdym for we shouldn't wait for the tts when using steam adapter?

Ok that makes sense, so by default, even if we use the StreamAdapter, it'll use the llm output for the transcription_node

theomonnom · 2025-07-01T10:04:52Z

livekit-agents/livekit/agents/voice/agent.py

    def transcription_node(
-        self, text: AsyncIterable[str], model_settings: ModelSettings
-    ) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]:
+        self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings


IMO we should just keep it implicit for now

longcw added 6 commits June 11, 2025 18:17

add timed string to tts

3b52e69

fix transcription sync when starts

92eb005

clean up

7678064

fix type

9905410

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

5b3abd8

…tion

update tts node

01d8d9b

longcw requested a review from a team June 12, 2025 07:54

update tts task

93d6434

theomonnom reviewed Jun 15, 2025

View reviewed changes

livekit-agents/livekit/agents/voice/io.py Outdated Show resolved Hide resolved

longcw added 2 commits June 16, 2025 09:44

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

267848a

…tion

use AudioFrame.user_data

94d648f

longcw commented Jun 16, 2025

View reviewed changes

add timed agent transcript example

03512f2

This was referenced Jun 16, 2025

Feature Request: Utterance timestamps in the ChatContext or Transcript #2326

Open

Bug: TTS interruption causes discrepancy between spoken audio and displayed chat text #2607

Closed

rename to userdata

bc717e1

theomonnom reviewed Jun 16, 2025

View reviewed changes

longcw added 4 commits June 16, 2025 20:00

support elevenlabs

aba239e

update 11labs

10b6d6a

enable TTS timed transcript by default

7711201

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

e57bb4b

…tion

longcw added 2 commits June 18, 2025 09:51

add use_tts_aligned_transcript

535d17c

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

885127e

…tion

longcw requested a review from theomonnom June 19, 2025 02:27

theomonnom approved these changes Jun 19, 2025

View reviewed changes

theomonnom reviewed Jun 19, 2025

View reviewed changes

longcw added 9 commits June 20, 2025 11:32

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

5f1577b

…tion

use type and update to aligned_transcript

3e297d5

use livekit 1.0.10

a4fc26e

cleanup tee peer

b4104c3

Revert "cleanup tee peer"

184d05d

This reverts commit b4104c3.

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

d2751a9

…tion

Create changeset-06b469b0.md

76e7e87

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

04f3534

…tion

Merge remote-tracking branch 'origin/main' into longc/timed-transcrip…

3d9fc4a

…tion

Merge remote-tracking branch 'origin/theo/agents1.2' into longc/timed…

9165efd

…-transcription

longcw changed the base branch from main to theo/agents1.2 July 1, 2025 09:33

Merge remote-tracking branch 'origin/theo/agents1.2' into longc/timed…

48f09b3

…-transcription

theomonnom approved these changes Jul 1, 2025

View reviewed changes

longcw merged commit d870f87 into theo/agents1.2 Jul 1, 2025
1 check passed

longcw deleted the longc/timed-transcription branch July 1, 2025 13:21

	self, text: AsyncIterable[str \| TimedString], model_settings: ModelSettings
	self, text: AsyncIterable[str], model_settings: ModelSettings

Comments

Conversation

longcw commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Changeset File Detected

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbachlechner commented Jun 17, 2025

Uh oh!

theomonnom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom commented Jun 19, 2025

Uh oh!

theomonnom Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sansjack commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

longcw commented Jun 12, 2025 •

edited

Loading

github-actions bot commented Jun 12, 2025 •

edited

Loading

theomonnom Jun 16, 2025 •

edited

Loading

theomonnom Jun 19, 2025 •

edited

Loading

theomonnom Jul 1, 2025 •

edited

Loading

theomonnom Jul 1, 2025 •

edited

Loading