Replies: 14 comments 29 replies
-
Hi Florian, Thanks for writing up this proposal 🙂 Interoperability between our projects would be great, so I'll do what I can to help. Using websockets and PCM audio as you described sounds good to me. This should allow almost anything to participate as an audio source 👍 Have you considered modeling the API after Home Assistant's WebSocket API? They have an "auth" step very similar to what you described. Their event structure also seems pretty simple, something like: {
"id": 5,
"type":"event",
"event":{
"data":{},
"event_type":"onaudiostart"
}
}
Adding some kind of "session id" into
The Web Speech API result structure seems reasonable to me. Extra properties could always be added for things like:
TrainingI think this is where things are going to get complicated. For the non-dictation case, something like Kaldi Active Grammars would be great to be able to use from a client. The author of Jaco Assistant (Daniel) has also proposed a Markdown format for sharing skills between assistants (riddle example). This includes:
Using rhasspy-nlu, I can transform these Markdown files into n-gram language models or directly into finite-state transducers for Kaldi. Since most of our STT engines use n-gram models or FSTs, we might consider this as a basic training format. |
Beta Was this translation helpful? Give feedback.
-
Hi Michael, thanks for stopping by 😃
I have not but I will definitely take a look. Compatibility to HA is probably not the worst idea ^^.
This can be done for sure, but my concern was more about the underlying STT engine's capability to transcribe multiple streams at the same time ... if the machine has enough resources anyway ^^. I think Nickolay from Vosk said that this is no problem, so I guess I'm just going to try and see :-)
👍
I would love to see that. I think this needs language models that are trained in a specific way or? As far as I remember there was only one model on the Vosk page that mentions it specifically.
This is probably something I could adapt for SEPIA. There is a similar format in SEPIA that's used for the Teach-UI and all the sentences trained with it can be loaded from the server via a REST endpoint. I've implemented this actually for the purpose of language model training 🙂 .
I refined this later to JSON objects stored in the database with a more complex structure but the idea remains the same and I always wanted to further improve it to replace Well the training part is certainly a very interesting topic but maybe a bit out of scope for version 1 of the STT server 😅 . |
Beta Was this translation helpful? Give feedback.
-
Hi Florian, this is an interesting idea. Recently I did update Jaco's STT framework Scribosermo with new models. The network has now a better performance than the DeepSpeech network I did use before, but is as fast as it. The repository provides pretrained models in English, German, French and Spanish, which are exported into tflite-format for easy usage. Today I also did add support for streaming the audio input. Greetings, |
Beta Was this translation helpful? Give feedback.
-
Great @DanBmh it certainly sounds very interesting 😎 |
Beta Was this translation helpful? Give feedback.
-
Hi Florian, everyone!
My experiments with offline ASRs
My simplistic/temporary comparisons:
Temporary conclusion: I vote vosk :)
BTW, Chapeau to Michael dissemination and for Rhasspy prject and also chapeau to your works, Florian!!
Websockets? I tend to agree about the fact websocket are a possible winner, nevertheless, considering that your clients are, if I well understood, mobile phones web app or native apps, I want to share a possible warning using raw websockets. Questions: are your clients every possible device? I'm asking because, in my modest experience, every time I used raw websockets I had a lot of communication issues, by example you have to manage re-connections / errors with your own protocol, compatibility, data structure when mixing binary (audio) and others data types, etc.. So on a client-server architectures where clients are remote web apps (on the browser) I recently preferred to use using socket.IO. By example, this is an architecture I build (see slide 21): See a comparison of features here: Another problem I see with websockets is the fact that maybe you want client submit an "object", a data structure that not only containing the audio binary data streaming, but also some attributes, e.g. the language model (or the language), the grammar, ad ID, the data format, etc. If so, you have to send this object with a binary serialization protocol on top of websockets... like protocolol buffers, etc. BTW socketIO seems to be able to send objects containing binary/blobs, see: BTW, pseudocode of a server that ingest audio data using socket.IO: To be honest I'm not sure that a websocket/socketio is the best interface for a server interface that has to be "not-hardcoded". I have to think about it.
Sentence-based (message-based) transcript vs streaming-based transcript In the long term, I tend to agree with you that streaming is "better" to cover all possible application scenarios (dictation). My consideration is that it's maybe not so useful in the context of to manage back and forth turns (conversations!) on a voice assistant. In theory It's the Dialog Manager (DM) in charge to reply to "partial" (streamed) sentences, but it's really hard to build a dialog manager able to be "smart" to reply to user with not-concluded (partial) sentences. So in general I'd go for a message based approach in cases of conversational systems. BTW, streaming processing could be instead useful for vertical applications where the assistant has to detect and react as-soon-as-possible to the user (security at work in hazardous contexts, etc.) but please note that in this case the DM probably has to process emotional/ton metadata and currently we don't have (mainstream) ASR technology that supply this kind of audio/speech/non-verbal info interpretation.
Compressed versus uncompressed - audio transfer In my previous software I did the uncompression task server-side, converting an OPUS FILE to a WAV FILE using ffmpeg. Stupid maybe but at the end ffmpeg process run pretty fast.
Yes for a specific dictation application,
Not necessarily. Using Vosk or DeeSpeech (Coqui) you can load model(s) in memory ONCE at the server startup time. Of course with this choice, at runtime clients could ask a transcript on a subset of possible languages (or models). BTW splitting model creation from transcript task is the basic key to optimize run-time transcript latencies.
Yes, a more note about websockets and the global architecture:
See above: Vosk doesn't block, DeepSpeech blocks!
BTW, on the client request I'd optionally add, if ASR anable the feature, the daptative grammars" attribute. See an example using Vosk: https://github.com/solyarisoftware/voskJs/tree/master/examples#transcript-http-server related to the proposed architecture here:
I'd go for a JSON / data structure. Hope to be useful. |
Beta Was this translation helpful? Give feedback.
-
Thanks, giorgio! I'll keep going until they stop me ;)
Standard websockets differentiate between text (UTF-8) and binary data. The python websockets library, for example, returns a @fquirin, if you're going to use Vosk for your initial test, I'd like to create a similar STT server for Coqui STT. How do you want to handle the authentication? I had planned to just mirror what Home Assistant does, essentially:
After that, the client should send a start message to configure things; something like: { "type": "startsession", "siteId": "...", ... } and then the client starts streaming audio. Responses from the server would have the same Thoughts? |
Beta Was this translation helpful? Give feedback.
-
Wait @solyarisoftware , are you saying you got faster transcription with the large model on the SAME! machine? 😲 ... I gotta test this as well! 🤔
I was a bit concerned about this as well but so far I haven't seen notable differences and in one test the container was even faster! 🤷♂️ 😅 I will keep an eye on it but containers are what people are asking for ^^ (and it really saves a lot of pain :-p).
That's about correct ... well actually the SEPIA assist server currently also manages the connection to the TTS server but I want to offer a 3rd interface (client, SEPIA, 3rd party API) with a direct connection to a MaryTTS compatible API as well (e.g. Larynx/Open TTS by Michael) just to give users all options. SEPIA was always about total customization so I guess it makes sense.
Why not? It kind of works identical to the Web Speech API: idle -> connect to ASR server with grammar option -> stream -> transcribe -> final result -> disconnect -> idle. That's about as atomic as it gets I guess :-)
In the case of SEPIA its basically the client but in theory it can be the assist or chat server as well with a bit of code refactoring. The SEPIA assist server was initially built to work behind a load balancer so state management would have been hard to synchronize. Anyway, this STT server is just responsible for speech to text. Its only one of the puzzle pieces and its designed to work independently. Think of the Web Speech API again ;-).
I remember commenting it somewhere and before I rebuilt the audio lib I assumed it wasn't possible but it works on Chrome/Chromium (probably Edge) and Safari. I'm not sure if they updated it or if I just did it the right way this time ^^. But even though it works I prefer the manual resampling, it gives you more control over performance and quality and you can combine it with the PCM 16bit integer conversion required for WAV encoding. I'm planning to but a demo of the audio lib online but I'm still changing parts all the time. Its pretty cool though, you can chain all the modules and build a complete pipeline of audio recording tools required for speech 😁
I was thinking about this as kind of an opt-in feature for users to donate their recordings. You can always dump the stream to a file after the stream. This is what I'm doing in the old server actually :-)
Ah, in this case I mean audio-input-start to audio-input-end :-) "Command" is confusing here "input" might be the better word.
Thats the plan @synesthesiam 🙂
It would be great to have both. The plan was to keep everything the same up to the point where we feed the chunks into the recognizer. I've seen the necessary code for Vosk (
That is pretty similar (almost identical) to the way SEPIA authenticates to it's chat server so I agree we can take it.
Actually we could send this together with the answer to the auth. request to save one ping-pong 🤔 .
You mean
The weird thing about the Web Speech API results is that they have: |
Beta Was this translation helpful? Give feedback.
-
I had assumed I'd just build a Python server separate for this. Part of the reasoning is that I plan to add wake word detection to my server as well. When configured, this would first pass the audio chunks into a wake word recognizer, then into STT once a detection has occurred. Without wake word detection configured, it would just be an STT server.
Let's just stick with the way SEPIA does it then 🙂 Do you have a doc describing that?
Sure, that's fine with me. What do you propose as the format?
Yeah, that seems reasonable to me. For results, though, I propose we have: {
"type": "result",
"result": {
"isFinal": true,
"transcript": "..."
},
"resultIndex": -1,
"results": [...]
} where the "result" is the best result and the "results" list is empty unless an n-best list is supported/requested. I don't see the point in making everyone dig through an object tree just to get the transcript. |
Beta Was this translation helpful? Give feedback.
-
Too add my own thoughts is that a TTS audio stream is all that is needed from TTS as audio delivery frameworks such as squeezelite, snapcast & airplay already exist are already maintained and likely to do a better job already than some sort of streaming service currently from scratch. You don't need a streaming audio STT you just need the metadata to a audio multiplexer can direct to the right mixer. If you look at available services they seem to split into room/zones and corresponding channels and shouldn't embed specific technology that will exclude any but just have a middleware bridge to connect to all. You just need bridge ware that can provide templated metadata translation that contains a number of services for delivery of a audio stream that map the destinations. Zones/Rooms are sessions and channel is a device that require mapping services through the conversion of parameters through templated mappings. Compression/codec prob depends on the service you are feeding to what its expecting at the mixer and not end delivery but guess that should be part of the template and service install. If a VoiceAI system has there own embedded audio system then that is just another service to be added and templated, even if source and destination 'moniker' is the same which is unlikely with 3rd party services. |
Beta Was this translation helpful? Give feedback.
-
Hi everybody @synesthesiam , @solyarisoftware , @DanBmh Everything is in the 'dev' branch right now: https://github.com/SEPIA-Framework/sepia-stt-server/tree/dev |
Beta Was this translation helpful? Give feedback.
-
@synesthesiam , @DanBmh Thanks to @domcross I've made some progress with Scribosermo as well and managed to build a test-system on aarch64 (Raspberry Pi4) and am64 🥳 . I've collected all my Scribosermo knowledge here: https://github.com/fquirin/scribosermo-stt-setup @DanBmh There is some confusion on my side regarding the Scribosermo model checkpoints. Can the German QuartzNet actually be used in an open-source project like SEPIA or are the Nvidia terms-of-use prohibiting that? 🤔 |
Beta Was this translation helpful? Give feedback.
-
Great to read 👍 Regarding your question about the model license, I think there should be no problem if you use them in an open source project, but I'm not completely sure. I didn't fully understand Nvidia's terms-of-use and if they still apply after training over the weights, because you get a different model then. |
Beta Was this translation helpful? Give feedback.
-
The API docs have been updated to describe the general process flow and the 'welcome' event (authentication, model/language selection, mode, grammar, etc.), just in case someone was interesting in building a Python or Node.js client ;-) 😁 ... . |
Beta Was this translation helpful? Give feedback.
-
I've updated the STT-Server repository with the BETA version of a Python client library and demo script that demonstrates STT on the command line 🙂 . |
Beta Was this translation helpful? Give feedback.
-
Hi everybody 🙂
I've been planning to update the SEPIA STT server for a while now but due to technical challenges and time constraints I haven't really managed to make any progress since 2019.
Now it is 2021 and I feel like the circumstances have improved a lot (except maybe the time constraints 😅 ). On the one hand I've managed to build a new web-audio library that simplifies audio processing in web-based/mobile clients (soon to be released). On the other hand speech recognition tools like Vosk or Coqui have made it a lot easier to install and run the STT engines on different platforms and to update the language models. Besides that Michael from Rhasspy is doing a great job of piecing together a lot of open-source voice projects and resources that fly around in the web to things like the Larynx TTS server (and much more).
To cut a long story short I'm going to rebuild the STT server more or less from scratch and I want to use the chance to make it as compatible as possible to all the other great projects out there. That's why I'd like to invite you to a little discussion before I start 🙂 .
Here are my ideas/requirements:
Open questions:
Rough description of the client-server communication (draft):
I'm building this specifically to work together with the SEPIA web-based cross-platform client but I'd be very happy if the server can become useful for many other open-source projects! 🤖 💬
Any comments, questions and ideas are greatly appreciated! 🙂
cu,
Florian
Beta Was this translation helpful? Give feedback.
All reactions