Real Time Streaming #10

mercuryyy · 2023-12-18T19:03:45Z

mercuryyy
Dec 18, 2023

Is it possible at exec of TTS cmd, to Stream the results in chunks to something like a Temp_stream.wav file that will be playable immediately after exec as it is still being created.

So for example if i am Transcribing 100 words and it takes me 4 seconds. but i want to play the file in real time, meaning at the point of exec of the TTS command i want the Audio to start playing, you can essentially do real time.

Piper does this - https://github.com/rhasspy/piper
echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' |
./piper --model en_US-lessac-medium.onnx --output-raw |
aplay -r 22050 -f S16_LE -t raw -

But the models on Coqui_tts are better but longer to exec, but if we can Stream it wouldn't matter and can do real time.

aios-ai · 2023-12-18T19:39:33Z

aios-ai
Dec 18, 2023

Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback.

0 replies

mercuryyy · 2023-12-19T01:35:55Z

mercuryyy
Dec 19, 2023
Author

I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS

The main thing is streaming the raw audio to stdout as its produced

0 replies

erew123 · 2023-12-19T01:46:28Z

erew123
Dec 19, 2023
Maintainer

Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues.

Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed.

From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed.

0 replies

aios-ai · 2023-12-19T11:08:38Z

aios-ai
Dec 19, 2023

There are a view more things to consider:

How and when does text input reach the tts-engine: To receive answers as fast as possible we should start here. TG-webui can be used in streaming mode or normal mode. Normal mode is what is currently used by all tts engines AFAIK, but is obviously not the best option in terms of speed, as synthesis is starting only after the whole reply is made. Streaming mode would feed the tts-engine individual words which can't be used to generate a coherent sentence. So if we talk about instant or real time tts, we are talking about sentence by sentence and not word by word streaming. So there are two options here. Add "sentence-streaming" in TG-webui or add a feature into a tts-extension which gathers individual words from TG-webui in streaming mode and waits till at least one sentence is generated. For each complete sentence it calls the tts engine for synthesis. Sentences could be also very short so there must be a little bit more logic to it, to wait for a certain amount of characters/tokens.
I'm not sure if RealtimeTTS is capable of doing this. I know it can split a whole paragraph into sentences but does it also work with word-by-word streaming from the text generation?
Parallelisation: The "word-listener", tts-synthesis and playback of audio must be done in parallel
xtts actually has a nativ streaming mode which I did not test yet and which they them self did not use in their Voice chat space. In that space they do pretty much what should be the fastest way of getting audio results from the tts, they also utilized gradio to still be able to control the streamed audio
It's true that this would be a feature which mostly benefits systems which have spare VRAM, as it would do text-generation and speech synthesis in parallel and all the models must be loaded @erew123 do you think it could be an optional feature for people which most likely wouldn't use the LowVRAM feature anyway or is out of scope, due to the LowVram "incompatiblily"?

0 replies

mercuryyy · 2023-12-19T12:57:21Z

mercuryyy
Dec 19, 2023
Author

@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually
Should be easy to implement into the addon.

0 replies

erew123 · 2023-12-19T17:02:15Z

erew123
Dec 19, 2023
Maintainer

@Sascha353 @mercuryyy
xtts actually has a nativ streaming mode which I did not test yet
Not tested it either, but had spotted it. I am curious how well it will handle compared to the other one I suggested. Also how it deals with sentence breakdown.

Should be easy to implement into the addon.
Yes and no. All the other code will need to be caveated around e.g. got to make sure that low VRAM is disabled when people use such a mode. Its probable that it wont work with the narrator function, depending on how it would stream split sentences, so would need testing and then potentially code to flip that off and notify the user. Then of course, Ill need to document it because, if I don't, ill be getting all the questions why X isn't working correctly etc.

is out of scope, due to the LowVram "incompatibility"?
Technically speaking, not out of scope. However, I've only built AllTalk in the last few weeks. There's been good adoption, but I've also been fighting some fires here and there, a couple of minor hiccups and also helping the less technical people who have struggled with some things. Hence my focus on getting Alltalk very stable in its current form, very clear documentation, good troubleshooting (just added a basic diagnostic utility today and cleaned up the whole built in documentation + the whole github front page is re-written). I think I've spent about 14 hours on documentation over the last day or so. My next goal is to complete the JSON requests API for 3rd party apps + document. From there, potentially other TTS engines and things such as streaming.

0 replies

erew123 · 2023-12-19T21:00:02Z

erew123
Dec 19, 2023
Maintainer

@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process?

0 replies

mercuryyy · 2023-12-20T05:59:15Z

mercuryyy
Dec 20, 2023
Author

@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav
I was playing with the TTS built in options, got it to somewhat work outside of alltalk, it is not bad at all.

0 replies

aios-ai · 2023-12-20T09:24:57Z

aios-ai
Dec 20, 2023

I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui space). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here.

As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred.

It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc.

5 replies

erew123 Jan 15, 2024
Maintainer

@Sascha353 Im probably at a stage where this could quickly be implemented as theoretically all the bits are in place.... it would only be a small bit of code and a checkbox to add a streaming option to TG. As you say, played back at the console vs TG actually doing it. So Im popping this on the list of things to try at some point.

aios-ai Jan 15, 2024

Thanks for the heads-up. Yeah, everything comes together nicely :) I actually followed the progress of your repo and also saw the implementation of streaming for the tts_chunk_player. Really impressive what you have created over the last weeks. Also the SillyTavern support! In Browser streaming is also very handy and a good alternative to server-side, if you access via mobile and still want the advantage of streaming!

May I ask what was the intended use of the tts_chunk_player and do you plan on improving on it either in this repo or in an independent one? I feel like it could basically be a standalone tool for audio book generation. If this was the direction you was heading, speaker identification to assign individual voices would be an incredible useful feature.

The only bummer is the fact that coqui.ai is shutting down and they will probably sit on their license. So mid term there will be most likely other tts engines which will be more capable.

BTW: I was hitting the character limit of 250 once during testing of tts_chunk_player some weeks ago but can't reproduce it now (I do not have the same text anymore and updated alltalk since then). Did you improve the chunking with a "sentence lookahead" or something to make sure the current + the next chunk does not exceed the limit? Else I can create a bug report, if I'm able to reproduce it.

erew123 Jan 15, 2024
Maintainer

@Sascha353 Oh you tried the SillyTavern bits? I was kind of stealthing that in at the moment while I submit the PR to ST.

As for the chunk player, it was just the idea there was more flexibility and things you can do with a TTS and maybe people wanted long format TTS they could take with them or do something more with, like audio books, or their lecture notes or whatever really. Have a look on the Feature Requests/api suite #74 is that the sort of thing you mean?

As for Coqui Im waiting to see what happens on that one. I've got confirmation that everything that costs no money is remaining up and but Im waiting to see what happens with their github etc as they wont close it down, but someone will need to take it on, and theres a lots of TTS engines out there relying on their code.... So I cant see it falling to the wayside and Ive seen other AI companies showing an interest in taking on the assets. Time will tell on that one.

The 250 character limit you cant fully get around if you have sentences longer than 250 characters. Its a complex one... If you split a long sentence, the TTS engine will break the flow of the sentence making it sound like 2x sentences, but go over the 250 characters and you *may (not guaranteed) start to get TTS degradation. It may be something I could flag later on as a "check these sentences" before committing to one wav, allowing you to regenerate if they sound funny.

Ill probably do bits and bobs with the with it. Though Ive been working so damn hard on it for a month+ now (generally) I need things to stabilise a little. Make sure generally all the code is good and clean up any issues, before I go on doing other things.

aios-ai Jan 16, 2024

Github informs about all the changes, no stealthing possible :D I tested it just briefly. It works good and it's fast when using streaming without narrator. I also found one possible improvement already. If you have multiple characters talking or the same character talks again while audio is still playing, it will play new audio in parallel. Could be solved with an option for synchronous playback where audio is queued and only played if the current playback is done. Alternatively new playback interrupts current playback, maybe as a flag. Also, Narrator support in streaming mode would be nice of course :).

Yes, #74 seems to be close to what I had in mind. However, not all e-books have speaker tags like "Speaker1: ", so without a sophisticated AI for speaker identification it's probably difficult to have a solution for all cases. For those where speakers are always at the beginning of a sentence it could be done with regex, I guess. I also thought about having it in the tts_chunk_player UI but if the API supports it, someone could build the logic of reading text, identifying speakers and calling the TTS API.

I didn't encounter the character limit since then. I'll try to track it down, when I see it again.

erew123 Jan 16, 2024
Maintainer

it will play new audio in parallel Im about to make the PR changes they requested... and one is to shift handling of streaming playback to "global TTS controls"... so it will end up being whatever ST does, Ill have no control over that.

Narrator support in streaming mode technically yes, but.... Id have to put A lot of code into the ST extension.... and theres no way Im submitting a PR to ST with another 400+lines of code. Its already about 3-4x bigger than all bar 1 other TTS extension. I just wanted to get the main bit in for now. See how that goes, deal with bugs/issues+new user base (possibly) and bump other features in later on.

erew123 · 2023-12-21T04:47:23Z

erew123
Dec 21, 2023
Maintainer

FYI, I've moved this over to Discussions, just so its not sat in the issues queue. I am thinking around the suggestions/ideas and will get back to you all at some point soon.

0 replies

mercuryyy · 2023-12-30T18:11:15Z

mercuryyy
Dec 30, 2023
Author

@erew123

Are you open to make some custom PAID modification to our own code to add real time streaming.
We are using chunks = model.inference_stream(
it works fine but we want to actually have it in real time with open api set to Stream=True

0 replies

erew123 · 2023-12-30T21:59:45Z

erew123
Dec 30, 2023
Maintainer

@mercuryyy Im happy to take a look and if I feel its something I can help with, then sure!

If its not something I can help with, Ill happily hold my hand up and say no!

Do you want to tell me/show me what you have? Or do you need an NDA or something first?

5 replies

mercuryyy Jun 3, 2024
Author

Hey @erew123 sorry i missed your reply its been a long time.

No need for an NDA at this point but if we can speak in private you can msg me on discord - https://discord.gg/GU9T8mPt
(PatriotSoul)

I'v tried to find your contact you can also share here if you would like and i'd be happy to contact you.

Thank you looking forward hearing from you.

erew123 Jun 3, 2024
Maintainer

@mercuryyy No probs. Can you give me 24-48 hours and Ill drop you a msg on discord. Im just in the final bits of getting this v2 tested/out and Ill catch up once Ive cleared my desk of all this !

mercuryyy Jun 4, 2024
Author

Yeah sure thing! Excited for the release. If you need help testing let me know.

erew123 Jun 6, 2024
Maintainer

Not forgotten... Just been cramming like crazy to get the BETA out AllTalk v2 BETA Download Details & Discussion

I will drop you a msg on Discord in a day or so!

erew123 Jun 10, 2024
Maintainer

sent you a friend request on Discord as the link had expired. It will be from Kudo***

gboross · 2024-01-31T12:11:21Z

gboross
Jan 31, 2024

Hello,

I've been exploring your project and I'm highly impressed by the work you've done. The features and implementation are genuinely ingenious, and I wanted to extend my congratulations on such a brilliant job!

I have a request and I'm wondering if it might align with your future development plans. Would you consider creating a Docker image for the streaming server with deepspeed integration and .json speaker options, similar to the streaming server setup previously developed for Coqui? Such an addition would be incredibly beneficial and useful for many users.

Thank you for your fantastic work on this project, and I look forward to any possibility of this enhancement.

Best regards.

0 replies

erew123 · 2024-01-31T15:44:54Z

erew123
Jan 31, 2024
Maintainer

Hi @gboross. Thanks for your kind comments.

Although I've not made any official mention of it yet https://hub.docker.com/r/erew123/alltalk_tts that's probably what you are looking for, Though I have to admit I've not spent a lot of time testing it.

A question for you though, what do you mean my ".json speaker options"? As standard, whatever is in the voices directory is available as speakers to generate from, but happy to look at something else if I'm missing something.

Thanks

6 replies

erew123 Jan 31, 2024
Maintainer

Hi @gboross I'm assuming you don't have Text-generation-webui? https://github.com/oobabooga/text-generation-webui

If not, you would do a standalone installation (2nd option down here) and basically a duplicate of this video here (please ensure you have the correct sections of the C++ tools installed as its a Python requirement.

The docker installs within Windows Subsystem for Linux and will install DeepSpeed as standard (CUDA/Nvidia card being a requirement here). There's a CUDA and non CUDA version of the docker, so you will be given the choice as to which to use. If you dont use the CUDA version, you are basically running in CPU mode as AMD and Mac Metal support are still not working (due to AMD and Apple lagging behind with issues currently).

Are you referring to the speaker.pth file? coqui-ai/TTS#1688 and https://docs.coqui.ai/en/latest/models/xtts.html#inference to use the models trained speakers?

Thanks

gboross Jan 31, 2024

Thank you for your response. I've made some progress, but for some reason, it doesn't want to download the xttsv2_2.0.2 model when I run start_alltalk. Anyway, I'm trying with Docker; I've got it running, but I can't seem to find the endpoints and documentation. Actually, I intend to use it as a streaming server, similar to how we've been using the Coqui streaming server, but that one doesn't have deepspeed, nor is it further developed. I've also attached an example of the .json file I'm thinking of.
So, the Docker is up and running, what's the next step? :D

default_speaker.json

gboross Jan 31, 2024

Ah, I think I got it working on Docker, sorry for all the silly questions.
However, how do I determine if deepspeed is actually working or not? Because I don't feel any difference between this and the Coqui xtts streaming server, where it supposedly isn't present.

Also, here, before every generation, it analyzes the specific speaker's .wav file, right? That adds to the latency too, doesn't it?

gboross Jan 31, 2024

Oh, I figured it out in the meantime, sorry for all the questions. It's brutally fast with deepspeed. The request for the speaker.json still stands. :)

erew123 Feb 1, 2024
Maintainer

Hi @gboross I've had a look for anything to do with speaker.json files to do with XTTSv2 (that AllTalk is based upon). I never saw any reference to them when I developed them and I've had decent hunt for anything to do with them. I cant find any reference or code samples for XTTSv2 using them https://docs.coqui.ai/en/latest/models/xtts.html#inference and this from 2022 suggests that Coqui moved away from JSON files coqui-ai/TTS#1688 other than for VITS models.

Having said that, I looked specifically at their code https://github.com/coqui-ai/xtts-streaming-server/blob/main/server/main.py and also their video, which does show it downloading an XTTSv2 model. I suspect this is related to the speaker.pth file and possibly replacing that to use that with conditioned latents. However, Ive made no provision currently in AllTalk for doing this, because Coqui made no reference to either being able to do this in any code samples or therefore any explanation as to how it works. Best I can tell it is doing exactly what AllTalk is doing on each run, its just AllTalk uses the wav sample file each time and their XTTS streaming server is using a pre-calculated JSON file that has gone through this code:

def predict_speaker(wav_file: UploadFile):
    """Compute conditioning inputs from reference audio file."""
    temp_audio_name = next(tempfile._get_candidate_names())
    with open(temp_audio_name, "wb") as temp, torch.inference_mode():
        temp.write(io.BytesIO(wav_file.file.read()).getbuffer())
        gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
            temp_audio_name
        )
    return {
        "gpt_cond_latent": gpt_cond_latent.cpu().squeeze().half().tolist(),
        "speaker_embedding": speaker_embedding.cpu().squeeze().half().tolist(),
    }


def postprocess(wav):
    """Post process the output waveform"""
    if isinstance(wav, list):
        wav = torch.cat(wav, dim=0)
    wav = wav.clone().detach().cpu().numpy()
    wav = wav[None, : int(wav.shape[0])]
    wav = np.clip(wav, -1, 1)
    wav = (wav * 32767).astype(np.int16)
    return wav


def encode_audio_common(
    frame_input, encode_base64=True, sample_rate=24000, sample_width=2, channels=1
):
    """Return base64 encoded audio"""
    wav_buf = io.BytesIO()
    with wave.open(wav_buf, "wb") as vfout:
        vfout.setnchannels(channels)
        vfout.setsampwidth(sample_width)
        vfout.setframerate(sample_rate)
        vfout.writeframes(frame_input)

    wav_buf.seek(0)
    if encode_base64:
        b64_encoded = base64.b64encode(wav_buf.getbuffer()).decode("utf-8")
        return b64_encoded
    else:
        return wav_buf.read(

As I say, this is effectively what AllTalk is doing each time you request TTS to be generated. However, AllTalk isn't built to handle JSON files for the speaker file. Can it be done, yes, but its not a simple change at all. In fact it would be a re-write of quite a lot of bits of code, through to the API, model loaders and other such things. I dont know of any way to reverse a JSON to a wav, I actually suspect its probably not possible. Do you have the original sample wav file that you used to generate your JSON file as that would probably be your simplest route?

Real Time Streaming #10

Replies: 14 comments · 16 replies

mercuryyy Dec 19, 2023 Author

erew123 Dec 19, 2023 Maintainer

mercuryyy Dec 19, 2023 Author

erew123 Dec 19, 2023 Maintainer

erew123 Dec 19, 2023 Maintainer

mercuryyy Dec 20, 2023 Author

erew123 Jan 15, 2024 Maintainer

erew123 Jan 15, 2024 Maintainer

erew123 Jan 16, 2024 Maintainer

erew123 Dec 21, 2023 Maintainer

mercuryyy Dec 30, 2023 Author

erew123 Dec 30, 2023 Maintainer

mercuryyy Jun 3, 2024 Author

erew123 Jun 3, 2024 Maintainer

mercuryyy Jun 4, 2024 Author

erew123 Jun 6, 2024 Maintainer

erew123 Jun 10, 2024 Maintainer

erew123 Jan 31, 2024 Maintainer

erew123 Jan 31, 2024 Maintainer

erew123 Feb 1, 2024 Maintainer

Replies: 14 comments 16 replies

mercuryyy
Dec 19, 2023
Author

erew123
Dec 19, 2023
Maintainer

mercuryyy
Dec 19, 2023
Author

erew123
Dec 19, 2023
Maintainer

erew123
Dec 19, 2023
Maintainer

mercuryyy
Dec 20, 2023
Author

erew123 Jan 15, 2024
Maintainer

erew123 Jan 15, 2024
Maintainer

erew123 Jan 16, 2024
Maintainer

erew123
Dec 21, 2023
Maintainer

mercuryyy
Dec 30, 2023
Author

erew123
Dec 30, 2023
Maintainer

mercuryyy Jun 3, 2024
Author

erew123 Jun 3, 2024
Maintainer

mercuryyy Jun 4, 2024
Author

erew123 Jun 6, 2024
Maintainer

erew123 Jun 10, 2024
Maintainer

erew123
Jan 31, 2024
Maintainer

erew123 Jan 31, 2024
Maintainer

erew123 Feb 1, 2024
Maintainer