Replies: 14 comments 16 replies
-
Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback. |
Beta Was this translation helpful? Give feedback.
-
I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS The main thing is streaming the raw audio to stdout as its produced |
Beta Was this translation helpful? Give feedback.
-
Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues. Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed. From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed. |
Beta Was this translation helpful? Give feedback.
-
There are a view more things to consider:
|
Beta Was this translation helpful? Give feedback.
-
@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually |
Beta Was this translation helpful? Give feedback.
-
@Sascha353 @mercuryyy Should be easy to implement into the addon. is out of scope, due to the LowVram "incompatibility"? |
Beta Was this translation helpful? Give feedback.
-
@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process? |
Beta Was this translation helpful? Give feedback.
-
@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav |
Beta Was this translation helpful? Give feedback.
-
I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui space). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here. As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred. It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc. |
Beta Was this translation helpful? Give feedback.
-
FYI, I've moved this over to Discussions, just so its not sat in the issues queue. I am thinking around the suggestions/ideas and will get back to you all at some point soon. |
Beta Was this translation helpful? Give feedback.
-
Are you open to make some custom PAID modification to our own code to add real time streaming. |
Beta Was this translation helpful? Give feedback.
-
@mercuryyy Im happy to take a look and if I feel its something I can help with, then sure! If its not something I can help with, Ill happily hold my hand up and say no! Do you want to tell me/show me what you have? Or do you need an NDA or something first? |
Beta Was this translation helpful? Give feedback.
-
Hello, I've been exploring your project and I'm highly impressed by the work you've done. The features and implementation are genuinely ingenious, and I wanted to extend my congratulations on such a brilliant job! I have a request and I'm wondering if it might align with your future development plans. Would you consider creating a Docker image for the streaming server with deepspeed integration and .json speaker options, similar to the streaming server setup previously developed for Coqui? Such an addition would be incredibly beneficial and useful for many users. Thank you for your fantastic work on this project, and I look forward to any possibility of this enhancement. Best regards. |
Beta Was this translation helpful? Give feedback.
-
Hi @gboross. Thanks for your kind comments. Although I've not made any official mention of it yet https://hub.docker.com/r/erew123/alltalk_tts that's probably what you are looking for, Though I have to admit I've not spent a lot of time testing it. A question for you though, what do you mean my ".json speaker options"? As standard, whatever is in the voices directory is available as speakers to generate from, but happy to look at something else if I'm missing something. Thanks |
Beta Was this translation helpful? Give feedback.
-
Is it possible at exec of TTS cmd, to Stream the results in chunks to something like a Temp_stream.wav file that will be playable immediately after exec as it is still being created.
So for example if i am Transcribing 100 words and it takes me 4 seconds. but i want to play the file in real time, meaning at the point of exec of the TTS command i want the Audio to start playing, you can essentially do real time.
Piper does this - https://github.com/rhasspy/piper
echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' |
./piper --model en_US-lessac-medium.onnx --output-raw |
aplay -r 22050 -f S16_LE -t raw -
But the models on Coqui_tts are better but longer to exec, but if we can Stream it wouldn't matter and can do real time.
Beta Was this translation helpful? Give feedback.
All reactions