Draft: New WebSocket, duplex, streaming audio STT server for Zamia, Vosk, etc. #112

fquirin · 2021-04-27T07:46:46Z

fquirin
Apr 27, 2021
Maintainer

Hi everybody 🙂

I've been planning to update the SEPIA STT server for a while now but due to technical challenges and time constraints I haven't really managed to make any progress since 2019.

Now it is 2021 and I feel like the circumstances have improved a lot (except maybe the time constraints 😅 ). On the one hand I've managed to build a new web-audio library that simplifies audio processing in web-based/mobile clients (soon to be released). On the other hand speech recognition tools like Vosk or Coqui have made it a lot easier to install and run the STT engines on different platforms and to update the language models. Besides that Michael from Rhasspy is doing a great job of piecing together a lot of open-source voice projects and resources that fly around in the web to things like the Larynx TTS server (and much more).

To cut a long story short I'm going to rebuild the STT server more or less from scratch and I want to use the chance to make it as compatible as possible to all the other great projects out there. That's why I'd like to invite you to a little discussion before I start 🙂 .

Here are my ideas/requirements:

Like the server before it will be based on WebSocket connections because they work cross-plattform on more or less every client you can find compared to gRPC and WebRTC, they are easy to handle and give you reasonable low-level access if needed (e.g. for audio streams).
Audio will be streamed to the server in chunks and processed as soon as the chunks arrive. This is essential to keep the transcription time low and deliver acceptable user experience while using cheap hardware like a Raspberrry Pi 4.
Audio data will be wave encoded 16bit PCM 16000 Hz mono chunks. Resampling is done in the client if required. Support for other codes like OPUS would be nice at some point to save bandwidth ... but its kinda hard to do cross-plattform ... :-|
Partial results are returned from the server as soon as they are available. For long running dictation mode this is essential. For short sentences (as they are common in any command/control style assistant) it doesn't really matter what the partial results actually show but the appearance of words alone makes a huge difference for the user experience.
The server will check user authentication before allowing the connection. This will be optional but is a default requirement for every SEPIA server. This was previously done in kind of an API-key style to make it as fast and simple as possible (create token once, check token server-side via cached look-up table) and I intend to keep this method.
There will be an endpoint to change server settings like the active STT model. Switching language models on-the-fly is usually not realistic because it takes too long and doesn't really make sense when you use edge devices like the Raspberry Pi. In a scenario where you need more than one model at the same time you should run more servers ^^.
Socket event messages will be roughly based on the W3C Speech-API SpeechRecognition Interface (this is basically the Web-Speech-API) but splitting it up into client-side (e.g. 'onaudiostart', 'onspeechstart') and server-side ('onsoundstart', 'onresult', etc.).
There will be an endpoint for language-model training ... eventually ;-)
VAD will happen in the client and can optionally be added to the server later, but there will be a configurable maximum transcription time per session. Since typical command & control style interactions are usually never longer than ~10s this will probably be the default setting.

Open questions:

Can we handle more than one session at a time properly? This of cause depends on the performance of the device on one hand, but it also depends on the implementation of the STT engine. If I remember correctly Vosk can handle multiple sessions without a problem but this needs to be tested. If this is not the case the server needs to block new sessions while busy.
What should be the message format for 'onresult' events? I prefer to stay very close to the Web-Speech-API format since the web-audio lib is supposed to work as kind of a drop-in replacement for Web-Speech but I'm open for suggestions and I'm even willing to implement different "flavors" ;-)

Rough description of the client-server communication (draft):

Client starts speech recognition request and connects to STT server WebSocket port.
Server checks if it has capacity then allows "unsafe" connection and sends authentication request to client.
Client sends authentication message.
Server checks auth. message, upgrades connection to "safe" and sends welcome reply to client. (Steps 1-4 usually happen so fast that its never a problem for user experience and also makes sure the connection is stable).
Client checks welcome reply with information about server version, active acoustic and language model, etc. then starts sending audio chunks ('onaudiostart' event).
Server accepts audio chunks and starts filling up buffers. Sends 'onsoundstart' event to client.
Client observes speech signal (VAD) and waits for 'onresult' events. If VAD says stop it triggers 'onspeechend' event and eventually 'onaudioend'.
Server sends 'onresult' events as soon as partial results arrive (flag 'isFinal' = false), 'onsoundend' when no more audio chunks arrive within a certain time-span and final result when done.
Client receives final result, disconnects from server and switches back to idle state.
Server registers disconnect and goes back to idle state.

I'm building this specifically to work together with the SEPIA web-based cross-platform client but I'd be very happy if the server can become useful for many other open-source projects! 🤖 💬

Any comments, questions and ideas are greatly appreciated! 🙂

cu,
Florian

synesthesiam · 2021-04-27T18:38:08Z

synesthesiam
Apr 27, 2021

Hi Florian,

Thanks for writing up this proposal 🙂 Interoperability between our projects would be great, so I'll do what I can to help.

Using websockets and PCM audio as you described sounds good to me. This should allow almost anything to participate as an audio source 👍

Have you considered modeling the API after Home Assistant's WebSocket API? They have an "auth" step very similar to what you described. Their event structure also seems pretty simple, something like:

{
   "id": 5,
   "type":"event",
   "event":{
      "data":{},
      "event_type":"onaudiostart"
   }
}

Can we handle more than one session at a time properly?

Adding some kind of "session id" into event.data would make this work. If an empty string is a valid session id too, this would allow clients that assume a single session to also work without anything special.

What should be the message format for 'onresult' events?

The Web Speech API result structure seems reasonable to me. Extra properties could always be added for things like:

Detected speaker id
Detected language
Individual transcript tokens (possible with timing information and confidences)

Training

I think this is where things are going to get complicated. For the non-dictation case, something like Kaldi Active Grammars would be great to be able to use from a client.

The author of Jaco Assistant (Daniel) has also proposed a Markdown format for sharing skills between assistants (riddle example). This includes:

Separation of training sentences by intent
Optional and alternative words (a | b | )
External entity files that can be used to tag parts of sentences

Using rhasspy-nlu, I can transform these Markdown files into n-gram language models or directly into finite-state transducers for Kaldi. Since most of our STT engines use n-gram models or FSTs, we might consider this as a basic training format.

1 reply

solyarisoftware May 9, 2021

Hi Michael,

Have you considered modeling the API after Home Assistant's WebSocket API?

But it is not clear to me how to apply that approach to the requirement here, having by example audio transcript client-server requests that I imagine as one-shot (stateless) websockets requests containing the audio binary buffer and metadata/properties (userId, authorization, language, mode=transcript, etc.).

The Web Speech API result structure seems reasonable to me. Extra properties could always be added for things like:

Detected speaker id

Detected language

Individual transcript tokens (possible with timing information and confidences)

Yes, my question in the above point is how to add "properties" in a raw websocket? using a serialization protocol as BSON/protocolo Buffer?

Detected speaker id
This is possible in Vosk (not in DeepSpeech)

Detected language
usually ASRs (and the above mentioned) doesn't detect language. But that's could be a super useful feature.

Individual transcript tokens (possible with timing information and confidences)
This is possible in Vosk (not in DeepSpeech, from my memory)

Training

I think this is where things are going to get complicated. For the non-dictation case, something like Kaldi Active Grammars would be great to be able to use from a client.

Yes, for non-dictation cases!

If I'm not wrong, Vosk enables adaptive grammars using exactly the Kaldi format.

Apparently the grammar (e.g. as a list of expected words/sentences) could be one of the properties of the client request, but please not that the these sentences are related to the specific "stateful" conversational context. So you want that a Dialog Manager (DM) store dynamically this data in a "user session" data storage, and the ASR could use this data when transcribing.
See the rough sketch: https://twitter.com/solyarisoftware/status/1389829342264446985?s=20

That's interesting for me. Connecting the DM to the ASR, could simple, by example on my dialog manager, NaifJs;
you have to feed the ASR with the grammar(dialog_id.state_id) stored on a session_id.state_id property:
https://github.com/solyarisoftware/naifjs/blob/master/doc/concepts.md#state-tracker
https://github.com/solyarisoftware/naifjs/blob/master/doc/sessions.md#sessions

BTW, in very simple cases (command&control assistants), you can asimilate the intent to the state_id (and a relative grammar).

The "bad" news is that ASR accuracy depends from the DM state.
The good news is that I experimented super low latencies with Vosk, if grammars are used.

Using rhasspy-nlu, I can transform these Markdown files into n-gram language models or directly into finite-state transducers for Kaldi. Since most of our STT engines use n-gram models or FSTs, we might consider this as a basic training format.

I have to study the Daniel's Jaco interesting framework!
In general I believe you have to collect, for each conversational state(N)
all probable sentences user could speech in the state N, creating a grammar(n) as a list of words/sentence.
Right?

fquirin · 2021-04-28T07:47:33Z

fquirin
Apr 28, 2021
Maintainer Author

Hi Michael, thanks for stopping by 😃

Have you considered modeling the API after Home Assistant's WebSocket API? They have an "auth" step very similar to what you described.

I have not but I will definitely take a look. Compatibility to HA is probably not the worst idea ^^.

Adding some kind of "session id" into event.data would make this work.

This can be done for sure, but my concern was more about the underlying STT engine's capability to transcribe multiple streams at the same time ... if the machine has enough resources anyway ^^. I think Nickolay from Vosk said that this is no problem, so I guess I'm just going to try and see :-)

The Web Speech API result structure seems reasonable to me. Extra properties could always be added for things like...

👍

something like Kaldi Active Grammars would be great to be able to use from a client.

I would love to see that. I think this needs language models that are trained in a specific way or? As far as I remember there was only one model on the Vosk page that mentions it specifically.

The author of Jaco Assistant (Daniel) has also proposed a Markdown format for sharing skills between assistants (riddle example). This includes
[...]
Using rhasspy-nlu, I can transform these Markdown files into n-gram language models or directly into finite-state transducers for Kaldi.

This is probably something I could adapt for SEPIA. There is a similar format in SEPIA that's used for the Teach-UI and all the sentences trained with it can be loaded from the server via a REST endpoint. I've implemented this actually for the purpose of language model training 🙂 .
In addition SEPIA can read training sentences in a very quirky format that I've "invented" around the dawn of time I guess 😆

What's the date;;    command=chat;;    reply=Today is the <local_date_MMddyyyy>
Turn on the <var1> in the <var2> room;;    command=smarthome;;    action=<on>;;    ...

I refined this later to JSON objects stored in the database with a more complex structure but the idea remains the same and I always wanted to further improve it to replace <var1> by something like <device_name> etc. that does a look-up internally of a list of devices. In reality this is much harder, especially in German since static lists don't really work so well so you need some fuzzy matching.

Well the training part is certainly a very interesting topic but maybe a bit out of scope for version 1 of the STT server 😅 .

0 replies

DanBmh · 2021-04-28T15:02:16Z

DanBmh
Apr 28, 2021

Hi Florian,

this is an interesting idea.
Michael already mentioned the skill definition format with Markdown, so I just want to add another possibility for the STT engine.

Recently I did update Jaco's STT framework Scribosermo with new models. The network has now a better performance than the DeepSpeech network I did use before, but is as fast as it. The repository provides pretrained models in English, German, French and Spanish, which are exported into tflite-format for easy usage. Today I also did add support for streaming the audio input.
Decoding currently uses the library from DeepSpeech, so the language models are compatible with both frameworks. Jaco already has a script to transform the skills syntax definition into a customized n-gram scorer file. For performance reasons I always would recommend a specialized language model if the users have a special domain.

Greetings,
Daniel

2 replies

DanBmh Apr 28, 2021

Streaming example can be found here: Link

DanBmh Apr 28, 2021

Multiple sessions should be possible, you just would need to handle the buffers and decoder objects for each session. Depending on the size of the language model, you might be able to handle about 2-3 session on a Raspi4 at the sime time.

fquirin · 2021-04-29T07:34:02Z

fquirin
Apr 29, 2021
Maintainer Author

Great @DanBmh it certainly sounds very interesting 😎
Does it run on Raspberry Pi 4 (32bit and/or 64bit?)? I can't seem to find installation instructions besides the container build script 🤔

4 replies

DanBmh Apr 29, 2021

Yes, that's possible. Example is here: https://gitlab.com/Jaco-Assistant/Scribosermo/-/tree/master/extras/misc#testing-transcription-on-raspberry-pi. Installation is very easy, see provided Containerfile. I did test it on 32bit Raspi, transcription runs faster than realtime (~0.7x) with the full English language model (and even faster with small customized LMs).

fquirin Apr 29, 2021
Maintainer Author

If I read your Containerfile_Raspbian right then I can use your prebuilt wheel file for armv7 correct? Is this file a recent version and do you maybe have a version for aarch64 (raspios 64bit) as well? I don't want to go through the whole process of cross-building right now ^^.

DanBmh Apr 29, 2021

Yeah, you can use the prebuilt wheel for armv7, it has the same version as the latest releases from DeepSpeech project. For the aarch64 version check out DeepSpeech's forum, maybe someone already did build it, I don't have one for this as well.

fquirin Apr 29, 2021
Maintainer Author

I can't find the aarch64 version anywhere on the web :-( Guess I just have to wait until I can prepare another 32bit installation.

solyarisoftware · 2021-05-08T16:49:10Z

solyarisoftware
May 8, 2021

Hi Florian, everyone!

On the other hand speech recognition tools like Vosk or Coqui have made it a lot easier to install and run the STT engines on different platforms and to update the language models.

My experiments with offline ASRs
I'm doing some experiments using ASRs you mention: Vosk (and DeepSpeech, where Coqui is a recent fork) and i released two micro projects to use them from nodejs bindings. Maybe my simple tests are useful. See:

My simplistic/temporary comparisons:

Latency:
on my laptop, Vosk latencies are very short: < 500 msecs elapsed for short sentence transcripts (in English). The same sentence transcripts take > ~2000 msecs on DeepSpeech! More: Using Vosk grammars (possible with a specific type of models), I realized that above ~500 msecs lowers to << 100 msecs. I believe that's fantastic for voice assistants for home automation, simple command&control applications.
Accuracy:
I just tested a set of few speech files. I have to deepen with serious tests.
RAM usage:
that's a tricky point. Of course it depends on the language model architectures.
- On Vosk, with (English) "large" model, the RAM size allocated is > 3.2 GB (huge) whereas with "short" language models the RAM is pretty low (on my voskJs repo tests directory yo can find details).
- On DeepSpeech the pro is that the langue model is loaded in RAM as a shared memory (a shared library?) so the "first time" the language model is created and loaded in memory take a lot of seconds on my host, but afterward latencies are few tents of milliseconds. All this on a single process tests. It's be verified what happens if DeepSpeech is used with a multi-process or a multi-thread architecture (see here below)
Requests/seconds:
Here's the best part. Above mentioned STTs have completely different architectures with run-time pros and cons.
- Vosk is "by design "multithreaded" when it transcript a sentence (with the "Recognizer"). That's fantastic when your server has a multicore CPU (I didn't try GPU), allowing to manage multiple concurrent server transcript (or streaming, see later) requests. Unfortunately the NodeJs binding has a critical issue and Nicolay is kindly taking care of that.
- DeepSpeech: it seems that DeepSpeech transcript function runs on a single thread (CPU core) so to build a concurrent requests server you need to set up a multi-process or multi-thread architecture at the application level, if you have enough CPU cores (Again, I didn't test GPU behavior).
  See my post here: https://discourse.mozilla.org/t/how-to-use-deepspeech-for-a-text-to-speech-server-in-nodejs/79636/6
  Deepspeech, as you know, is discontinued by Mozilla and the application crash with latest nodejs version:
  nodejs package deepspeech crashes (using node v16.0.0) mozilla/DeepSpeech#3642. We hope that Coqui, the DeepSpeech fork made by his creators out of Mozilla, will bring DeepSpeech to a step forward. I don't know...

Temporary conclusion: I vote vosk :)

Besides that Michael from Rhasspy is doing a great job of piecing together a lot of open-source voice projects and resources that fly around in the web to things like the Larynx TTS server (and much more).

BTW, Chapeau to Michael dissemination and for Rhasspy prject and also chapeau to your works, Florian!!

Like the server before it will be based on WebSocket connections because they work cross-plattform on more or less every client you can find compared to gRPC and WebRTC, they are easy to handle and give you reasonable low-level access if needed (e.g. for audio streams).

Websockets?

I tend to agree about the fact websocket are a possible winner, nevertheless, considering that your clients are, if I well understood, mobile phones web app or native apps, I want to share a possible warning using raw websockets.

Questions: are your clients every possible device?
Could client be everywhere on the internet or are in a local area network (at home/ WIFI connected)?

I'm asking because, in my modest experience, every time I used raw websockets I had a lot of communication issues, by example you have to manage re-connections / errors with your own protocol, compatibility, data structure when mixing binary (audio) and others data types, etc.. So on a client-server architectures where clients are remote web apps (on the browser) I recently preferred to use using socket.IO. By example, this is an architecture I build (see slide 21):
https://docs.google.com/presentation/d/1ieZnAdREzEGXkcO4C_XPIbS9YAnE76mB0wpP2k-yOlQ/edit#slide=id.gc0058244ce_0_70

See a comparison of features here:
https://stackoverflow.com/questions/10112178/differences-between-socket-io-and-websockets

Another problem I see with websockets is the fact that maybe you want client submit an "object", a data structure that not only containing the audio binary data streaming, but also some attributes, e.g. the language model (or the language), the grammar, ad ID, the data format, etc. If so, you have to send this object with a binary serialization protocol on top of websockets... like protocolol buffers, etc. BTW socketIO seems to be able to send objects containing binary/blobs, see:
https://socket.io/docs/v4/client-api/#socket-emit-eventName-%E2%80%A6args-ack

BTW, pseudocode of a server that ingest audio data using socket.IO:
https://github.com/solyarisoftware/voskJs/tree/master/examples#sockeio-server-pseudocode

To be honest I'm not sure that a websocket/socketio is the best interface for a server interface that has to be "not-hardcoded". I have to think about it.

Audio will be streamed to the server in chunks and processed as soon as the chunks arrive. This is essential to keep the transcription time low and deliver acceptable user experience while using cheap hardware like a Raspberrry Pi 4.

Sentence-based (message-based) transcript vs streaming-based transcript

In the long term, I tend to agree with you that streaming is "better" to cover all possible application scenarios (dictation). My consideration is that it's maybe not so useful in the context of to manage back and forth turns (conversations!) on a voice assistant.

In theory It's the Dialog Manager (DM) in charge to reply to "partial" (streamed) sentences, but it's really hard to build a dialog manager able to be "smart" to reply to user with not-concluded (partial) sentences. So in general I'd go for a message based approach in cases of conversational systems. BTW, streaming processing could be instead useful for vertical applications where the assistant has to detect and react as-soon-as-possible to the user (security at work in hazardous contexts, etc.) but please note that in this case the DM probably has to process emotional/ton metadata and currently we don't have (mainstream) ASR technology that supply this kind of audio/speech/non-verbal info interpretation.

Audio data will be wave encoded 16bit PCM 16000 Hz mono chunks. Resampling is done in the client if required. Support for other codes like OPUS would be nice at some point to save bandwidth ... but its kinda hard to do cross-plattform ... :-|

Compressed versus uncompressed - audio transfer
Pretty all ASRs need an uncompressed audio data format as you said. I'm perplexed to let the (remote) clients to encode client-side, for two reasons: CPU usage and net bandwidth. I recently made an application (see slides above) where clients are web browser. So I used Mediarecorder and Audio web api. In my experience, there is no way to select a 16bit PCM 16000 Hz mono on any famous browser I tested. A mediarecorder/browser implementation "bug" maybe. Now you could tell me that your client program can possibly encode itself (on the browser). But sending uncompressed speech audio on the (inter)net is it worth it?
I like OPUS compressed format (that's made by default on Chrome/Firefox browser). It seems to me a good compromise between size (minimizing net bandwidth) and audio "quality".

In my previous software I did the uncompression task server-side, converting an OPUS FILE to a WAV FILE using ffmpeg. Stupid maybe but at the end ffmpeg process run pretty fast.

Partial results are returned from the server as soon as they are available. For long running dictation mode this is essential. For short sentences (as they are common in any command/control style assistant) it doesn't really matter what the partial results actually show but the appearance of words alone makes a huge difference for the user experience.

Yes for a specific dictation application,
No for a (voice) conversational assistant, in my opinion.

There will be an endpoint to change server settings like the active STT model. Switching language models on-the-fly is usually not realistic because it takes too long and doesn't really make sense when you use edge devices like the Raspberry Pi. In a scenario where you need more than one model at the same time you should run more servers ^^.

Not necessarily. Using Vosk or DeeSpeech (Coqui) you can load model(s) in memory ONCE at the server startup time. Of course with this choice, at runtime clients could ask a transcript on a subset of possible languages (or models). BTW splitting model creation from transcript task is the basic key to optimize run-time transcript latencies.

Socket event messages will be roughly based on the W3C Speech-API SpeechRecognition Interface (this is basically the Web-Speech-API) but splitting it up into client-side (e.g. 'onaudiostart', 'onspeechstart') and server-side ('onsoundstart', 'onresult', etc.).

Yes, a more note about websockets and the global architecture:

if the clients (of the ASR) are remote phones, I'm a bit perplexed about using them. I'd go fir socket.io but I admit that's debatable.
if the client of the ASR is another local server, as the "frontend" server in the architecture of my application (see slide 22) so I'd go for a local net TCP socket for the best performance!
BTW In my above mention application I used the Wit.ai API free Facebook service (for the prototype), just submitting a WAV file. Sure ugly and not performing, but in general I can't see as evil submitting FILES to the local ASR. Please tell me if I'm totally wrong.

Can we handle more than one session at a time properly? This of cause depends on the performance of the device on one hand, but it also depends on the implementation of the STT engine. If I remember correctly Vosk can handle multiple sessions without a problem but this needs to be tested. If this is not the case the server needs to block new sessions while busy.

See above: Vosk doesn't block, DeepSpeech blocks!

Socket event messages will be roughly based on the W3C Speech-API SpeechRecognition Interface (this is basically the Web-Speech-API) but splitting it up into client-side (e.g. 'onaudiostart', 'onspeechstart') and server-side ('onsoundstart', 'onresult', etc.).

BTW, on the client request I'd optionally add, if ASR anable the feature, the daptative grammars" attribute. See an example using Vosk: https://github.com/solyarisoftware/voskJs/tree/master/examples#transcript-http-server

related to the proposed architecture here:
https://twitter.com/solyarisoftware/status/1389829342264446985?s=20

What should be the message format for 'onresult' events? I prefer to stay very close to the Web-Speech-API format since the web-audio lib is supposed to work as kind of a drop-in replacement for Web-Speech but I'm open for suggestions and I'm even willing to implement different "flavors" ;-)

I'd go for a JSON / data structure.

Hope to be useful.
Giorgio

2 replies

fquirin May 10, 2021
Maintainer Author

Hi Giorgio,

thanks for all your thoughts and insights 🙂

Temporary conclusion: I vote vosk :)

I think most of your observations align with mine. Of cause (as you mentioned) the model plays a big role but this effects all systems more or less in the same way (big model = more RAM = longer processing time). I usually favor the smaller models because I think in any case its still necessary to adapt the LM or use adaptive grammar to get the required accuracy. Since you can use the Teach-UI in SEPIA to define new commands (and obtain a list of those sentences via API) I hope we can use them to train the LM and gradually optimize the assistant for the own use cases. The tricky part here is to generate the dictionary entries for unknown words automatically (@synesthesiam has some experience with that, I think we can use the same scripts that do the job for TTS).

Currently my plan is to do the first integration using Vosk but try to keep a modular structure wherever possible to add Coqui and Scribosermo later. Some reasons for Vosk are mentioned above but there are a few more:

It is very easy to install and I was able to build Docker containers for all platforms (x86_64, Armv7, Aarch64)
It is incredibly small, the Docker container with all essential components included (Python, packages, engine, small model) is only 450 MB!
The Vosk Python WebSocket server is a good starting point and it demonstrates the Python code for chunked transcription 😀

BTW, Chapeau to Michael dissemination and for Rhasspy prject and also chapeau to your works, Florian!!

Thanks 😃

Could client be everywhere on the internet or are in a local area network (at home/ WIFI connected)?
[...] I'm asking because, in my modest experience, every time I used raw websockets I had a lot of communication issues

Clients can be everywhere. SEPIA's default client-server connection is based on WebSockets (that's the chat-server component) and I never had major problems with it. I mean if you have a bad connection you will have problems in any case, especially in a voice assistant where you need something like 3s response time at most. If the connection drops it will buffer and try to reconnect and then exponentially back-off. To my understanding this is the recommended default method and its not too hard to implement. Far more complicated is the question what the client is supposed to do in the meantime since you should block certain interactions while the reconnect process is happening. It took quite some tweaking and refinements to get this right, but this is independent of the actual implementation of the connection itself.
Now this is for "chat" messages. I believe the right strategy for ASR is even simpler since you can discard anything if the connection breaks down ^^. The typical interaction with a voice assistant is about 4s, maybe even less. The chance that the connection breaks down within this time span is low meaning you will either have a problem when establishing the connection or you will most likely be fine. Since this has always been an issue, even with Web Speech API and Google servers SEPIA already has a couple of safety procedures to make sure its handled gently ... but like in the previous case this is client logic that is more or less independent of the WebSocket connection.
Socket.io might give you a slightly faster response time if the connection breaks down but I'm not even sure if this is actually a thing since in the end it is based on WebSocket as well.
I agree though that for something like a long running dictation mode that has to keep a stable connection for dozens of seconds it maybe useful to have Socket.io doing some magic :-) but this is not really the use-case here.

Another problem I see with websockets is the fact that maybe you want client submit an "object", a data structure that not only containing the audio binary data streaming, but also some attributes, e.g. the language model (or the language), the grammar, ad ID

All configuration options like language model, grammar and so on are handled in the welcome event when the connection is authenticated and established. From that point on you will stream audio bytes until the session is done. Actually the language model will probably only be switchable via a HTTP endpoint since it simply takes too long to load models on the fly (one server = one model at a time, at least for Raspberry Pi - see below for further discussion).

socketIO seems to be able to send objects containing binary/blobs

This is indeed an interesting point, but so far the most complicated thing to send was a simple array buffer and that can be as easy as websocket.send(new DataView(buffer)); in Javascript.
The old Microsoft speech recognition demo I've investigated a long time ago and used as kind of a reference for the SEPIA STT WebSocket server was using the first byte of each message as a control byte. I don't think we still need this and I understand that this would make building custom clients a bit harder but it is an option if everything else fails and its still pretty easy to do ^^.

My consideration is that it's maybe not so useful in the context of to manage back and forth turns (conversations!) on a voice assistant.
[...]
In theory It's the Dialog Manager (DM) in charge to reply to "partial" (streamed) sentences

I'm not sure if we are on the same page here. The scope of this server is only to transcribe audio into text. Partial results are usually a cosmetic feature indicating to the user that something is actually happening and its going in the right/wrong direction. Dialog management is done after the final result and not by this server/client.

In my experience, there is no way to select a 16bit PCM 16000 Hz mono on any famous browser I tested.

Actually it is (see beta realease of my new audio lib for SEPIA, documentation will follow at some point ^^). Chrome and Safari can do it, but what they do is using an integrated resampler with unknown properties and performance. The new SEPIA web-audio library uses a WASM compiled Speex resampler with 10 levels of quality to resample the audio buffer in realtime or optionally a FIR based resampler with static properties.

sending uncompressed speech audio on the (inter)net is it worth it? I like OPUS compressed format (that's made by default on Chrome/Firefox browser). It seems to me a good compromise between size (minimizing net bandwidth) and audio "quality".

Good question and the answer is I'm not sure 😋 .
Now the first thing you need to consider is that my reference server is always a Raspberry Pi (4). If you want to use compressed audio there are 4 steps: compress inside client - send - uncompress on server - transcribe, since the last time I checked Vosk can't use OPUS directly (and even if it could will probably uncompress internally). The question here is: how much does it "cost" to compress and uncompress and is it worth it if the time it takes to transcribe is longer than the transfer time?
Right now the issue I have with web-audio is that its pretty hard to get OPUS encoding + resampling + total control of each chunk. Considering that even mobile connections are running with 7Mbit/s + nowadays I think its ok to postpone this discussing for the sake of simplification ;-)

Not necessarily. Using Vosk or DeeSpeech (Coqui) you can load model(s) in memory ONCE at the server startup time

I didn't know that and it might be interesting to play with this! For most of my RPis its probably not really an option but a 8 GB version might be able to handle it 🤔

more note about websockets and the global architecture ...

I understand why you like Socket.io and I might be biased here by my personal taste for "as close to the core as possible if its not too hard to handle" ^^. The main reason why I don't want Socket.io is that you get an additional dependency and message overhead that you probably only need to make your life a little bit easier. Plain WebSocket is something that almost any Programming language and even some micro controllers can handle and the message size is reduced to the absolute minimum. With Socket.io you have to rely on given client implementations for your language. I quickly checked Python support and saw that the library is 4 years old! Other languages might not even have one at all. In addition most of the Socket.io features are not relevant for this use-case (HTTP polling fallback, broadcasting features, name-spaces, etc.) and messages are larger.

just submitting a WAV file. Sure ugly and not performing, but in general I can't see as evil submitting FILES to the local ASR. Please tell me if I'm totally wrong.

I wouldn't call it evil to send a WAV file but this usually implies that your whole process looks like this: record the full file, send the full file, wait for result. In my opinion the only way to get a good response time is to stream while recording and transcribe as soon as you see the first chunk. If your complete voice command is 3s this method is already at minimum 3s faster since you don't wait for the user to finish speaking.

BTW, on the client request I'd optionally add, if ASR anable the feature, the daptative grammars" attribute. See an example

YES! Thanks for bringing this up and the example 😃

I'd go for a JSON / data structure

Yes, thats kind of required. I was rather asking myself what the best JSON format is ;-). Web Speech API for example is kind of awkward somehow, it roughly looks like this (0, 1, ... being alternative transcriptions):

var event = {
    resultIndex: 0,
    results: [{
        isFinal: true,
        "0": {
            "transcription": "This is my test",
            "confidence": 0.8
        },
        "1": {
            "transcription": "This eat my vest",
            "confidence": 0.2
        }
    }]
}

... but for some reason a bunch of smart people decided that this is a good format ^^. I'm still evaluating if they were right :-p

Ok so this has become quite a long text 😬 but I hope I could clarify some things or at least make clear why I have certain opinions about topics you mentioned 🙂 .
It will take a few more days until I can finish the rewrite of the old STT server interface (client-side) as a module for the new web-audio lib so you still have some time to convince me of Socket.io 😁 ... but anyway if the WebSocket interface is running and its not performing well I'm always ready to try something else ;-)

cu,
Florian

solyarisoftware May 11, 2021

Thanks for clarifications feedback,
I quote you some points, sorry for long thread, but you submitted a lot of articulated open points :)

Of cause (as you mentioned) the model plays a big role but this effects all systems more or less in the same way (big model = more RAM = longer processing time).

Well, more RAM doesn't means longer latencies, in my simplistic tests (with Vosk).
Consider this speech audio:
https://github.com/solyarisoftware/voskJs/blob/master/audio/2830-3980-0043.wav

I got these latencies

English Vosk pretrained large model (> 3GB RAM) ~< 450 msecs
English Vosk small model (< 200 RAM) ~> 600 msecs
English Vosk small model (< 200 RAM), with adaptive grammar ~< 70 msecs !!!

I usually favor the smaller models because I think in any case its still necessary to adapt the LM or use adaptive grammar to get the required accuracy.

I agree, with some notes:

smaller model are perfect for task-oriented closed-domain
big models are mores suitable for dictation or for open-domain assistants
adaptive grammar are "stateful" by definition and they achieve great latency/accuracy when coupled with a dialog manager that track the state (that's a plus immo)
Nothing stops your ASR smart server to maybe use both approaches for the same client requests ...

Since you can use the Teach-UI in SEPIA to define new commands (and obtain a list of those sentences via API) I hope we can use them to train the LM and gradually optimize the assistant for the own use cases. The tricky part here is to generate the dictionary entries for unknown words automatically (@synesthesiam has some experience with that, I think we can use the same scripts that do the job for TTS).

That's an interesting "machine learning" approach (I guess used by google and all ASR by big players) ;-) and yes it's an interesting topic, with a single note: it's a long-processing task not feasible for realtime requests, immo it's a kind of backend / batch task and, worst, it maybe requires human annotation/supervision...

Currently my plan is to do the first integration using Vosk but try to keep a modular structure wherever possible to add Coqui and Scribosermo later. Some reasons for Vosk are mentioned above but there are a few more:

It is very easy to install and I was able to build Docker containers for all platforms (x86_64, Armv7, Aarch64)

It is incredibly small, the Docker container with all essential components included (Python, packages, engine, small model) is only 450 MB!

The Vosk Python WebSocket server is a good starting point and it demonstrates the Python code for chunked transcription grinning

Yes! With a warning: to increase performances I'd avoid to use containers on small computers as server (you mentioned a Rpi4). TBV.

Clients can be everywhere.

I see.

I'm biased on a different approach / architecture maybe, where I see the ASR as a backend component used "internally" on a multilayer/multi/engine system:

         end-use device client-server protocol (websockets, HTTP+websockets, socketio, etc.
         |
         |                             server-to-server protocol (websockets, etc)
         v                             |               |   
client -----> |                        v               v
client -----> |                    | <---> ASR server ---> |
client -----> | "front-end" server |                       | DM server
client -----> |                    | <---> TTS server <--- |
client -----> |

But instead it seems that you are thinking on a "flat services" like this:

client -----> |                                           
client -----> | ASR  server 
client -----> | TTS server
client -----> | etc.

Is that correct?

SEPIA's default client-server connection is based on WebSockets (that's the chat-server component) and I never had major problems with it. I mean if you have a bad connection you will have problems in any case, especially in a voice assistant where you need something like 3s response time at most. If the connection drops it will buffer and try to reconnect and then exponentially back-off. To my understanding this is the recommended default method and its not too hard to implement. Far more complicated is the question what the client is supposed to do in the meantime since you should block certain interactions while the reconnect process is happening. It took quite some tweaking and refinements to get this right, but this is independent of the actual implementation of the connection itself.
Now this is for "chat" messages. I believe the right strategy for ASR is even simpler since you can discard anything if the connection breaks down ^^. The typical interaction with a voice assistant is about 4s, maybe even less. The chance that the connection breaks down within this time span is low meaning you will either have a problem when establishing the connection or you will most likely be fine. Since this has always been an issue, even with Web Speech API and Google servers SEPIA already has a couple of safety procedures to make sure its handled gently ... but like in the previous case this is client logic that is more or less independent of the WebSocket connection.
Socket.io might give you a slightly faster response time if the connection breaks down but I'm not even sure if this is actually a thing since in the end it is based on WebSocket as well.
I agree though that for something like a long running dictation mode that has to keep a stable connection for dozens of seconds it maybe useful to have Socket.io doing some magic :-) but this is not really the use-case here.

Another problem I see with websockets is the fact that maybe you want client submit an "object", a data structure that not only containing the audio binary data streaming, but also some attributes, e.g. the language model (or the language), the grammar, ad ID

All configuration options like language model, grammar and so on are handled in the welcome event when the connection is authenticated and established. From that point on you will stream audio bytes until the session is done.

That's a possible solution! your ASR:

supply HTTP endpoints requests for INITIAL settings (you say WELCOME)
allocates websockets for successive binary audios exchange

It makes sense for me, but I'm still perplexed, because this does NOT allow to modify request parameters dynamically for each atomic request. By example a transcript request can't specify at run-time a state_id (e.g. an adaptive grammar to be associated to the speech).

socketIO seems to be able to send objects containing binary/blobs

This is indeed an interesting point, but so far the most complicated thing to send was a simple array buffer and that can be as easy as websocket.send(new DataView(buffer)); in Javascript.
The old Microsoft speech recognition demo I've investigated a long time ago and used as kind of a reference for the SEPIA STT WebSocket server was using the first byte of each message as a control byte. I don't think we still need this and I understand that this would make building custom clients a bit harder but it is an option if everything else fails and its still pretty easy to do ^^.

Yes, standard websockets are perfect if you send just one type of data (for us here the audio as PCM buffer), my doubt is about how to manage with websockets a complex object (type data structure) containing the binary audio data and some attributes (metadata, say strings) related and contained on the same binary websocket message.

My consideration is that it's maybe not so useful in the context of to manage back and forth turns (conversations!) on a voice assistant.
[...]
In theory It's the Dialog Manager (DM) in charge to reply to "partial" (streamed) sentences

I'm not sure if we are on the same page here. The scope of this server is only to transcribe audio into text. Partial results are usually a cosmetic feature indicating to the user that something is actually happening and its going in the right/wrong direction. Dialog management is done after the final result and not by this server/client.

Now is not clear to me where/who is managed the dialog management. Could you please clarify? Thanks!

In my experience, there is no way to select a 16bit PCM 16000 Hz mono on any famous browser I tested.

Actually it is (see beta realease of my new audio lib for SEPIA, documentation will follow at some point ^^). Chrome and Safari can do it, but what they do is using an integrated resampler with unknown properties and performance.

Weird. I tested months ago, on updated Chrome/Firefox browsers and on recent Linux/Android, using Mediarecorder API and I was NOT enable to get anything else from OPUS encoding @ 48Khz.
See also my open question on stackoverflow
https://stackoverflow.com/questions/62392352/mediadevices-getusermedia-how-can-i-set-audio-constraints-sampling-rate-bit-d

I'll do more tests asap.

The new SEPIA web-audio library uses a WASM compiled Speex resampler with 10 levels of quality to resample the audio buffer in realtime or optionally a FIR based resampler with static properties.

That's could be a way, TBV.

sending uncompressed speech audio on the (inter)net is it worth it? I like OPUS compressed format (that's made by default on Chrome/Firefox browser). It seems to me a good compromise between size (minimizing net bandwidth) and audio "quality".

Good question and the answer is I'm not sure .
Now the first thing you need to consider is that my reference server is always a Raspberry Pi (4). If you want to use compressed audio there are 4 steps: compress inside client - send - uncompress on server - transcribe, since the last time I checked Vosk can't use OPUS directly (and even if it could will probably uncompress internally). The question here is: how much does it "cost" to compress and uncompress and is it worth it if the time it takes to transcribe is longer than the transfer time?
Right now the issue I have with web-audio is that its pretty hard to get OPUS encoding + resampling + total control of each chunk. Considering that even mobile connections are running with 7Mbit/s + nowadays I think its ok to postpone this discussing for the sake of simplification ;-)

Good points.
yes, you maybe want to consider tradeoffs (to compress, and where to compress).
I confirm that Vosk works with 16bit PCM bugffers (you can just set the sampling rate, usually set by default to 16KHz)

Not necessarily. Using Vosk or DeeSpeech (Coqui) you can load model(s) in memory ONCE at the server startup time

I didn't know that and it might be interesting to play with this! For most of my RPis its probably not really an option but a 8 GB version might be able to handle it

more note about websockets and the global architecture ...

I understand why you like Socket.io and I might be biased here by my personal taste for "as close to the core as possible if its not too hard to handle" ^^. The main reason why I don't want Socket.io is that you get an additional dependency and message overhead that you probably only need to make your life a little bit easier. Plain WebSocket is something that almost any Programming language and even some micro controllers can handle and the message size is reduced to the absolute minimum. With Socket.io you have to rely on given client implementations for your language. I quickly checked Python support and saw that the library is 4 years old! Other languages might not even have one at all. In addition most of the Socket.io features are not relevant for this use-case (HTTP polling fallback, broadcasting features, name-spaces, etc.) and messages are larger.

I don't want to convince about the fact Socket.io is "better" and I see your points.
Socket.I is a opensource active project, updated to release 4.0 few months ago (https://socket.io/blog/socket-io-4-release/)
python interface (never used) seems updated a week ago: https://github.com/miguelgrinberg/python-socketio

just submitting a WAV file. Sure ugly and not performing, but in general I can't see as evil submitting FILES to the local ASR. Please tell me if I'm totally wrong.

I wouldn't call it evil to send a WAV file but this usually implies that your whole process looks like this: record the full file, send the full file, wait for result. In my opinion the only way to get a good response time is to stream while recording and transcribe as soon as you see the first chunk.

You are right, just consider an important usual requirement: in my case/but I believe that's very general, a (multi-user = server) voice assistant has to track end user speech++ requests (and assistant voice++ responses) as log FILES. Taht's for tracking/monitoring. So you generally want to save audio files. Nevertheless I admit that to optimize latencies, file storage could be a parallel task, and passing through disk give slowest performances.

If your complete voice command is 3s this method is already at minimum 3s faster since you don't wait for the user to finish speaking.

What do you mean when you say "complete voice command? it's the elapsed time from the end-user begin of speech to the end of assistant response? Cpuld you clarify? UX requirements must drive technical feasibility :-)

Thanks
giorgio

synesthesiam · 2021-05-11T16:11:22Z

synesthesiam
May 11, 2021

BTW, Chapeau to Michael dissemination and for Rhasspy prject and also chapeau to your works, Florian!!

Thanks, giorgio! I'll keep going until they stop me ;)

Yes, standard websockets are perfect if you send just one type of data (for us here the audio as PCM buffer), my doubt is about how to manage with websockets a complex object (type data structure) containing the binary audio data and some attributes (metadata, say strings) related and contained on the same binary websocket message.

Standard websockets differentiate between text (UTF-8) and binary data. The python websockets library, for example, returns a str for text data and bytes for binary. If we just put JSON metadata in text frames, wouldn't we be good? Or, as @fquirin said, just require that the metadata is part of the authentication process (meaning any changes requires a new websocket connection).

@fquirin, if you're going to use Vosk for your initial test, I'd like to create a similar STT server for Coqui STT.

How do you want to handle the authentication? I had planned to just mirror what Home Assistant does, essentially:

Client connects
Server responds { "type': "auth_required" }
Client responds { "type": "auth", "access_token": "12345" }
Server responds { "type": "auth_ok" } or { "type": "auth_invalid", "message": "..." }

After that, the client should send a start message to configure things; something like:

{ "type": "startsession", "siteId": "...", ... }

and then the client starts streaming audio. Responses from the server would have the same { "type": "...", ... } structure with "type" being the Web Speech API event types (including errors).

Thoughts?

1 reply

solyarisoftware May 12, 2021

Standard websockets differentiate between text (UTF-8) and binary data. The python websockets library, for example, returns a str for text data and bytes for binary. If we just put JSON metadata in text frames, wouldn't we be good?

Yes for me. Let me think about how to implement this:

Option 1: Websocket server ingest "self-consistent" websockets requests

As far as I understood, standard websockets allow text OR binary payloads. We need binary setting to transfer audio (PCM/OPUS), avoiding to convert binary in text. So, if I understand your statement, the client message could be made by 3 parts

Length (of JSON metadata/attributes) <- in a binary format like UINT16LE
JSON Attributes <- text (UTF-8)
audio data <- binary data

As proposed here: https://stackoverflow.com/a/38022867/1786393

Option 2: HTTP/JSON API server "spawn" websockets

Or, as @fquirin said, just require that the metadata is part of the authentication process (meaning any changes requires a new websocket connection).

How do you want to handle the authentication? I had planned to just mirror what Home Assistant does, essentially:

Client connects

Server responds { "type': "auth_required" }

Client responds { "type": "auth", "access_token": "12345" }

Server responds { "type": "auth_ok" } or { "type": "auth_invalid", "message": "..." }

After that, the client should send a start message to configure things; something like:
{ "type": "startsession", "siteId": "...", ... }
and then the client starts streaming audio. Responses from the server would have the same { "type": "...", ... } structure with "type" being the Web Speech API event types (including errors).

Thoughts?

In this hypothesis (option2), the "negotiation" between clients and server concerns not only authentications, but also other attributes/metadata, as language, model name, state, etc. This complicates a bit message exchanges, not just at start up time but at run time, if by example client would change settings (language, grammar, etc.) on the fly for single messages. My fear is about the syncronizationbetween HTTP/JSON "commands" requests/reply and the websockets data flow.

On the other hand, in option 1 requests are self-consistents with the cons that the websocket here are a "custom" binary interface with drawbacks (e.g. client.server binary formats compatibility, etc)

@fquirin, if you're going to use Vosk for your initial test, I'd like to create a similar STT server for Coqui STT.

Please consider my notes about the apparent NO multi-tasking architecture of Coqui STT (that's so far essentially the same of DeepSpeech, see my previous comment: #112 (comment)

So Vosk recognizer spawn threads for each request, "by design", whereas Coqui STT doesn't. You have to build a multi-pocess/multi-thread yourself.

fquirin · 2021-05-11T19:44:11Z

fquirin
May 11, 2021
Maintainer Author

I got these latencies
English Vosk pretrained large model (> 3GB RAM) ~< 450 msecs
English Vosk small model (< 200 RAM) ~> 600 msecs

Wait @solyarisoftware , are you saying you got faster transcription with the large model on the SAME! machine? 😲 ... I gotta test this as well! 🤔

I'd avoid to use containers on small computers as server (you mentioned a Rpi4). TBV.

I was a bit concerned about this as well but so far I haven't seen notable differences and in one test the container was even faster! 🤷‍♂️ 😅 I will keep an eye on it but containers are what people are asking for ^^ (and it really saves a lot of pain :-p).

But instead it seems that you are thinking on a "flat services" like this:

That's about correct ... well actually the SEPIA assist server currently also manages the connection to the TTS server but I want to offer a 3rd interface (client, SEPIA, 3rd party API) with a direct connection to a MaryTTS compatible API as well (e.g. Larynx/Open TTS by Michael) just to give users all options. SEPIA was always about total customization so I guess it makes sense.

I'm still perplexed, because this does NOT allow to modify request parameters dynamically for each atomic request

Why not? It kind of works identical to the Web Speech API: idle -> connect to ASR server with grammar option -> stream -> transcribe -> final result -> disconnect -> idle. That's about as atomic as it gets I guess :-)

Now is not clear to me where/who is managed the dialog management. Could you please clarify? Thanks!

In the case of SEPIA its basically the client but in theory it can be the assist or chat server as well with a bit of code refactoring. The SEPIA assist server was initially built to work behind a load balancer so state management would have been hard to synchronize. Anyway, this STT server is just responsible for speech to text. Its only one of the puzzle pieces and its designed to work independently. Think of the Web Speech API again ;-).

Weird. I tested months ago, on updated Chrome/Firefox browsers and on recent Linux/Android, using Mediarecorder API and I was NOT enable to get anything else from OPUS encoding @ 48Khz.

I remember commenting it somewhere and before I rebuilt the audio lib I assumed it wasn't possible but it works on Chrome/Chromium (probably Edge) and Safari. I'm not sure if they updated it or if I just did it the right way this time ^^. But even though it works I prefer the manual resampling, it gives you more control over performance and quality and you can combine it with the PCM 16bit integer conversion required for WAV encoding. I'm planning to but a demo of the audio lib online but I'm still changing parts all the time. Its pretty cool though, you can chain all the modules and build a complete pipeline of audio recording tools required for speech 😁

So you generally want to save audio files

I was thinking about this as kind of an opt-in feature for users to donate their recordings. You can always dump the stream to a file after the stream. This is what I'm doing in the old server actually :-)

What do you mean when you say "complete voice command?

Ah, in this case I mean audio-input-start to audio-input-end :-) "Command" is confusing here "input" might be the better word.

If we just put JSON metadata in text frames, wouldn't we be good?

Thats the plan @synesthesiam 🙂

if you're going to use Vosk for your initial test, I'd like to create a similar STT server for Coqui STT.

It would be great to have both. The plan was to keep everything the same up to the point where we feed the chunks into the recognizer. I've seen the necessary code for Vosk (from vosk import KaldiRecognizer -> ... -> rec.AcceptWaveform(chunk)) and if we can find the same for Coqui we can simply make 2 classes for the same interface 🙂 .

How do you want to handle the authentication? I had planned to just mirror what Home Assistant does, essentially:
Client connects
Server responds { "type': "auth_required" }
Client responds { "type": "auth", "access_token": "12345" }
Server responds { "type": "auth_ok" } or { "type": "auth_invalid", "message": "..." }

That is pretty similar (almost identical) to the way SEPIA authenticates to it's chat server so I agree we can take it.

After that, the client should send a start message to configure things; something like:

Actually we could send this together with the answer to the auth. request to save one ping-pong 🤔 .

Responses from the server would have the same { "type": "...", ... } structure with "type" being the Web Speech API event types (including errors).

You mean "type": "start" ... / "type": "result", ... / "type": "error" ... / "type": "end" ?
Something very close to Web Speech would be:

{
  name: "result",
  resultIndex: 0,
  results: [{
	  isFinal: false,
	  "0": {
		  transcript: "End of ..."
		  ...
	  }
  }],
  timeStamp: ...
}

The weird thing about the Web Speech API results is that they have: event.results[0].isFinal AND event.results[0][0].transcript which results in this awkward "0" ... "1" ... format 🤡

4 replies

solyarisoftware May 12, 2021

Thanks Florian for clarifications.

Wait @solyarisoftware , are you saying you got faster transcription with the large model on the SAME! machine? ... I gotta test this as well!

Yes, I confirm my previous statement, but I did STT of few test audio files:
https://github.com/solyarisoftware/voskJs/tree/master/audio#audio-files

These are recognized with 100% accuracy, and about Recognizer latency (left out the initial/one-off model load in RAM)
I always noted the same behavior:

small model with grammar < 100 ms
LARGE model < 500ms
small model without grammar > 600 ms

It's to be verified with a vast set of speech files. TBD.

I'd avoid to use containers on small computers as server (you mentioned a Rpi4). TBV.

I was a bit concerned about this as well but so far I haven't seen notable differences and in one test the container was even faster! I will keep an eye on it but containers are what people are asking for ^^ (and it really saves a lot of pain :-p).

BTW, I remember that Nicolay (Vosk author) discouraged to use dockers on a RPi, but I never tryied.

I'm still perplexed, because this does NOT allow to modify request parameters dynamically for each atomic request

Why not? It kind of works identical to the Web Speech API: idle -> connect to ASR server with grammar option -> stream -> transcribe -> final result -> disconnect -> idle. That's about as atomic as it gets I guess :-)

In this way, if I well understand:

the client made an HTTP request,
the server open a websocket,
client stream data on the websocket,
server respond with result (on the HTTP request?).

That's correct?

Now is not clear to me where/who is managed the dialog management. Could you please clarify? Thanks!

In the case of SEPIA its basically the client but in theory it can be the assist or chat server as well with a bit of code refactoring. The SEPIA assist server was initially built to work behind a load balancer so state management would have been hard to synchronize. Anyway, this STT server is just responsible for speech to text. Its only one of the puzzle pieces and its designed to work independently. Think of the Web Speech API again ;-).

I see. In the case your STT accept (dialog state dependent) grammars you need to connect STT with a dialog manager. That could be done also client side. In this scenario

the client must request to the grammar to the dialog manager, maybe using a state_id.
pass the grammar to the STT

I'd prefer to manage STT-DM logics, server-side, but i'm biased by a conversation-based multilayer architecture :)

Weird. I tested months ago, on updated Chrome/Firefox browsers and on recent Linux/Android, using Mediarecorder API and I was NOT enable to get anything else from OPUS encoding @ 48Khz.

I remember commenting it somewhere and before I rebuilt the audio lib I assumed it wasn't possible but it works on Chrome/Chromium (probably Edge) and Safari.

On a mobile phone? Android/iOS?

I'm not sure if they updated it or if I just did it the right way this time ^^. But even though it works I prefer the manual resampling, it gives you more control over performance and quality and you can combine it with the PCM 16bit integer conversion required for WAV encoding.

Great for flexibility.
I'm stil perplexed about streaming PCMs from remote clients. I'd curious to measure elapsed time summing local resampling/conversion + networking times. I believe that evaluation is relevant to understand what happens for "not short" user speeches. Consider this real case code spelling (with a duration of 22 secs), the WAV size is more than 600KB, where the OPUS compressed source is < 180K: https://github.com/solyarisoftware/voskJs/blob/master/audio/IT_RAIU_690011_4_25_U1.wav

fquirin May 13, 2021
Maintainer Author

always noted the same behavior:
small model with grammar < 100 ms
LARGE model < 500ms
small model without grammar > 600 ms

This is really interesting .. and confusing somehow 🙈 . What are the specs of your machine? (CPU, RAM) Is this accelerated by the GPU?

That's correct?

Not quite. Check out the order of events in my latest post 🙂 . Hope that clarifies it.

I'd prefer to manage STT-DM logics, server-side, but i'm biased by a conversation-based multilayer architecture :)

For SEPIA its a historical development ^^. The first version (ILA) was a Java application that did everything in one program, then I switched to client-server architecture where the client did STT, then I replaced this STT with another self-hosted server 😁 . I'm sure there are pros and cons for all the possible architectures ;-).
I haven't implemented Grammar control yet, but the current concept I have in mind is to bind it to the smart-service running on the assist server which would be able to submit the state to the client via the DM (I call it interview-module ^^).

On a mobile phone? Android/iOS?

Mobile and Desktop.

Consider this real case code spelling (with a duration of 22 secs), the WAV size is more than 600KB, where the OPUS compressed source is < 180K

Yes. This will be something I'd like to improve as soon as the WAV version is running.
But consider this: The lowest 3G mobile connection (they are already shutting down those) has a speed of 48 KB/s. If you stream the 22s long file your size can in theory be up to about 1MB (48KB*22) without limiting the transcription speed ;-)

solyarisoftware May 13, 2021

This is really interesting .. and confusing somehow . What are the specs of your machine? (CPU, RAM) Is this accelerated by the GPU?

It's a stupid HP laptop (stupid because i'm experiencing strange core usage that I believe are caused by cpuppower saving not controllable settings):

$ inxi -C -M -m
Machine:   Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required> 
           Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018 
Memory:    RAM: total: 11.59 GiB used: 5.11 GiB (44.1%) 
           RAM Report: permissions: Unable to run dmidecode. Root privileges required. 
CPU:       Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB 
           Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 894 2: 868 3: 884 4: 967 5: 788 6: 855 7: 772 8: 935

I don't think Vosk-API use GPU for the run-time recognizer. TBV.

As said in previous comment, my tests are absolutely superficial because I used just few English language sentences, but it seems to me that's the behavior is "stable". More seriously, we need to test different "language models" (of differen ASR technologies) with a possible common shared metrics (language, accuracy, latency) and common test data (words/sentences).

Not quite. Check out the order of events in my latest post . Hope that clarifies it.

I'm still confused, I admit.

Your concept is having both an HTTP and a websocket communication channles between a client and the server, right?
The client handshakes commands/requests on the HTTP channel. Afterward client send binary audio data on the websocket, right?
The server sends the response on the websocket channel? As JSON?

Thanks

fquirin May 13, 2021
Maintainer Author

As said in previous comment, my tests are absolutely superficial because I used just few English language sentences, but it seems to me that's the behavior is "stable"

I will need to make some tests on the RPi4 with live speech and then we'll get some more data. If its true in general its actually awesome :-D but this would mean the language model is not limiting the speed which seems counter-intuitive assuming the engine has to search in a much larger graph for the best match 🤔

Your concept is having both an HTTP and a websocket communication channles between a client and the server, right?

Yes, but the HTTP endpoint is basically only for general server settings and info. At least in the old STT server its used to switch the model which is a static, global settings (this is debatable if you have a large machine that can hold multiple models in memory and run at the same time). It will also be used to train language models :-).

The client handshakes commands/requests on the HTTP channel

Not really, this is done in the onmessage function of the socket connection (event types: welcome, result, error, etc.).

synesthesiam · 2021-05-11T20:04:30Z

synesthesiam
May 11, 2021

It would be great to have both. The plan was to keep everything the same up to the point where we feed the chunks into the recognizer.

I had assumed I'd just build a Python server separate for this. Part of the reasoning is that I plan to add wake word detection to my server as well. When configured, this would first pass the audio chunks into a wake word recognizer, then into STT once a detection has occurred. Without wake word detection configured, it would just be an STT server.

That is pretty similar (almost identical) to the way SEPIA authenticates to it's chat server so I agree we can take it.

Let's just stick with the way SEPIA does it then 🙂 Do you have a doc describing that?

Actually we could send this together with the answer to the auth. request to save one ping-pong 🤔

Sure, that's fine with me. What do you propose as the format?

You mean "type": "start" ... / "type": "result", ... / "type": "error" ... / "type": "end" ?

Yeah, that seems reasonable to me. For results, though, I propose we have:

{
  "type": "result",
  "result": {
    "isFinal": true,
    "transcript": "..."
  },
  "resultIndex": -1,
  "results": [...]
}

where the "result" is the best result and the "results" list is empty unless an n-best list is supported/requested. I don't see the point in making everyone dig through an object tree just to get the transcript.

2 replies

fquirin May 12, 2021
Maintainer Author

I had assumed I'd just build a Python server separate for this. Part of the reasoning is that I plan to add wake word detection to my server as well.

Ok I think it makes sense to keep this separate because it requires a long running constant connection and audio stream which is a bit out of my initial scope (SEPIA does VAD and WW in the client). It is interesting though ^^.
The new SEPIA web audio lib is designed to build a modular pipeline of audio processors ... maybe I can adopt this architecture for the server. One module could be Vosk ASR, one could be Coqui, ... Scribo... and WW etc..
Do you want to start right away? Or do I have a few days to come up with something like a PoC? ^^ 😁

Let's just stick with the way SEPIA does it then slightly_smiling_face Do you have a doc describing that?

The message format is a bit more complex between SEPIA client <-> chat server, but maybe the SEPIA client <-> CLEXI (something I built to augment the client) connection is a good reference. I've slightly adjusted the format for our use-case:

Client connects
Server sees client-connection event and puts connection in "unsafe" state (unsage only accepts welcome messages)
Client sends welcome event after onopen (this way we safe the waiting state for "auth_required"). Message format:

{
	"type": "welcome",
	"data": { "language": "en-US", "model": "...", "grammar": "..." },
	"access_token": "ABC123SUPERSAFE",
	"client_id": "my-client-123_v1",
	"ts": 1620804751062,
	"msg_id": 1
}

Server checks message format (text/binary) and authentication during "onmessage" event with type == "welcome". If authentication is required and token is valid the connection gets upgraded to "safe"/"waiting_for_audio" (all messages allowed) and server sends message:

{
	"type": "welcome",
	"code": 200,
	"info": {
		"version": "v1",
		"languages": ["en-US", "de-DE", ...],
		"active_language": "en-US",
		"models": [...],
		"active_model": "...",
		"features": ["adaptive_grammar", "wake_word", "vad"] (tbd)
	},
	"ts": 1620804751362,
	"msg_id": 1 (same as incoming ID)
}

If authentication failed server sends message {"type": "error", "error": "not_authorized", "code": 401} and closes connection. The server will also close the connection after a timeout (~5s, if no welcome event is coming).

Client checks welcome event onmessage. If "active_language" aligns with request it starts streaming audio chunks.
Server sees binary data/array buffer onmessage, assumes its audio chunks and starts feeding it to the audio pipeline (probably to a RingBuffer first).
If the server generates results they are sent as (new format ^^):

{
	"type": "result",
	"transcript": "This is a ",
	"isFinal": false,
	"confidence": "0.9",
	"features": {
		"speaker_id": "boss",
		"volume_rms": 0.1,
		...
	},
	"alternatives": [array of next best results]
}

What do you think? 🙂

DanBmh May 12, 2021

Part of the reasoning is that I plan to add wake word detection to my server as well. When configured, this would first pass the audio chunks into a wake word recognizer, then into STT once a detection has occurred.

Just a side note on this: If you are running the WW-service on the same server as the STT-service, you could think about using the STT-service for WW-detection as well. I did run a short experiment on this with Jaco and the smallest Scribosermo model in which I found that this can work better than Porcupine's models for specific phrases. Downside was that it only did work well for a few phrases which are commonly used like "i don't know", even though it is theoretically possible to use any text as wake word. And you might have a slightly higher power consumption, because the model is larger than a model tuned for WW-detection only.
Code is at: https://gitlab.com/Jaco-Assistant/Jaco-Satellite/-/merge_requests/24

StuartIanNaylor · 2021-05-11T21:03:46Z

StuartIanNaylor
May 11, 2021

Too add my own thoughts is that a TTS audio stream is all that is needed from TTS as audio delivery frameworks such as squeezelite, snapcast & airplay already exist are already maintained and likely to do a better job already than some sort of streaming service currently from scratch.
There is a whole rake of existing hardware that will be excluded and doesn't support custom firmware that just need mapping to the right mixer.

You don't need a streaming audio STT you just need the metadata to a audio multiplexer can direct to the right mixer.

If you look at available services they seem to split into room/zones and corresponding channels and shouldn't embed specific technology that will exclude any but just have a middleware bridge to connect to all.

You just need bridge ware that can provide templated metadata translation that contains a number of services for delivery of a audio stream that map the destinations.

Zones/Rooms are sessions and channel is a device that require mapping services through the conversion of parameters through templated mappings.

Compression/codec prob depends on the service you are feeding to what its expecting at the mixer and not end delivery but guess that should be part of the template and service install.
I mentioned squeezelite, snapcast & airplay as they where the 1st from the top of my head but there are a whole rake off services that may change across room/zone or device.
Rather than providing the service, map to service(s) so you can encompass all and not exclude interoperability.

If a VoiceAI system has there own embedded audio system then that is just another service to be added and templated, even if source and destination 'moniker' is the same which is unlikely with 3rd party services.

5 replies

fquirin May 13, 2021
Maintainer Author

Hi @StuartIanNaylor

I'm sorry, but I'm not sure if I can follow your arguments 😅 .
You mention "squeezelite" etc. in the context of TTS. Its a great tool for music streaming and I was reading a bit about it to see if I can use it for the SEPIA music service, but in TTS its overkill. A simple webserver (for example Nginx that is already running on every SEPIA microservice) can serve the audio files and most of the time the Java/Python/NodeJS server can host static files as well. Sizes are very small (few KB) and the SEPIA Client for example can start playing while buffering the audio.

You don't need a streaming audio STT you just need the metadata to a audio multiplexer can direct to the right mixer.

Are you suggesting to run another server that has the only purpose of receiving and distributing audio streams? It surely sounds interesting but what is the advantage here for STT? I fear that it increases complexity a lot and you pull a bunch of dependencies into your project that might not available cross-paltform in the hope to safe you some headaches of handling audio buffers. In addition 3rd party audio libraries often don't give you the low-level access to your audio buffer that you need for e.g. client side processing.

If a VoiceAI system has there own embedded audio system then that is just another service to be added and templated, even if source and destination 'moniker' is the same which is unlikely with 3rd party services.

I'm not completely against it but imagine you have a website with a voice assistant (e.g. the SEPIA client ^^) and your own STT server. Now you already disassemble the microphone input audio to the audio-buffer level because you need it to do e.g. wake-word and VAD. Then you simply send this audio-buffer to the server in chunks and thats it ;-). I wouldn't even call that an "audio system" at least not whats going on betwenn client and server ^^.

StuartIanNaylor May 13, 2021

Yes basically.

The only streams from STT are concurrent which even then is more the exception and the streams are very simple but its the meta data and mapping to those systems so that in a 'room / zone they can be easily implemented as why use the inferior or have additional audio capability when high quality is already there.

Also connecting and controlling media systems has huge benefits for Voice Ai but there are clear partitions in a voice AI system and the audio system is one and it is separate.

There isn't an advantage for TTS as its job is to create Text To Speech whilst audio distribution is another job.

You just need a simple socket or pipe and a zone and channel like metadata to describe delivery but audio delivery of any type is something else job.

Its not even much of a service as its just a bridging service likely running off mapping templates to allow interoperability to many.

The whole point that it isn't in STT is that there are no dependencies and should be a contained server in its own right even though relatively simple where its mapping and delivering audio streams.

fquirin May 13, 2021
Maintainer Author

So looking at the current draft what exactly would you change? And what would be the benefit? Let's leave TTS out for now since this discussion is about the generalized interface for the STT server.

StuartIanNaylor May 13, 2021

Yeah its just that I see strong parallels with audio in and audio out hence SST / TTS and for me there is a before and after that aggregates sources via zones/rooms to devices.

Again so there are no dependencies from mic device / KWS there is an intermediary that doesn't have specific protocols but again acts as middle ware and maps metadata and converts audio and provides base operations to feed a ASR and can act as a basic audio queue.

You need look at what libs are available and work at the lowest common denominator whicjh is likely esp32 and have that as a 'starter' connection rather than preference for any specific lib you may of used.

AMR-WB or codec choice really needs to be a consideration and raw PCMs such as 16Khz wavs over WiFi in any form of concurrency quickly build up and its like 2 decades since the Ipod sort of made the codec obigatory.

If SST can aggree on a fixed MFCC alg than like Google Lyra with the heavy resynthesis it provides an audio stream of the highest compression and still can be compressed further with simple gzip methods.
Its problematic as the source alg and not just parameters needs to be identical as often this shows difference in the resultant 'MFCC' image.

I am a big fan of distributed array mics which are just multiples in a zone and the simple function of using the stream in the zone with the best KW hit is often not implemented.

ASR shouldn't be dealing with streams just like TTS and there should be a middleware interoperable platform that can work with many in a multi-homed inout that doesn't dictate use apart from the plugin that connects to the ASR of use.

StuartIanNaylor May 14, 2021

If it doesn't make sense here is where I replied to a suggestion in the Rhasspy community and it sets a better context of explanation. As if you take the commonality of input and output look at the problems of interoperability and length of life.
You don't include unnessacary technology or services @ ASR or TTS as you apply a middleware at both those points that translate and map to many.

https://community.rhasspy.org/t/almond-rhasspy/2809/2?u=rolyan_trauts

fquirin · 2021-06-24T10:15:26Z

fquirin
Jun 24, 2021
Maintainer Author

Hi everybody @synesthesiam , @solyarisoftware , @DanBmh
The first BETA is now available for testing 🙂 . It would be great to get some feedback 😁

Everything is in the 'dev' branch right now: https://github.com/SEPIA-Framework/sepia-stt-server/tree/dev
Easiest way to get started is to pull one of the Docker containers (Amd64, Arm32, Arm64) and then visit the server start page (http://localhost:20741).
Documentation will be updated soon ^^.

7 replies

fquirin Jun 25, 2021
Maintainer Author

Awesome! Ty 😁

fquirin Jun 27, 2021
Maintainer Author

Hey @synesthesiam thanks to your pre-compiled files I've managed to update my old language model adaptation repo (Zamia fork) to create custom, small-domain LMs (or larger if you have a big corpus) that are based on Zamia Speech models and work in Vosk 😎 : https://github.com/fquirin/kaldi-adapt-lm .
The only thing still missing is to automatically extend the dictionary when the user "invents" new words ^^. I thought about playing with Phonetisaurus (maybe pre-built G2P models) and I have all the tools ready but the dictionary format included in the Zamia Kaldi models is kind of awkward 😬 . I think it has to be converted from XSAMPA to IPA to CMU Dict or something (then extended and converted back).
Did you ever fight your way through this? ^^

synesthesiam Jun 27, 2021

@fquirin I have a pre-trained Phonetisaurus g2p model for Zamia here: https://github.com/synesthesiam/en-us_kaldi-zamia/

For converting back and forth between phoneme sets, I've recently added support for this to gruut-ipa. It hasn't been thoroughly tested yet, but maybe this could be an opportunity 🙂

fquirin Jun 27, 2021
Maintainer Author

Looks interesting :-). So if I have the lexicon.txt included in Zamia's kaldi-generic-en-tdnn_250 model I assume I'd have to use something like gruut_ipa convert ipa sampa "[new ipa phonems]" ?
Can I convert the whole lexicon.txt back to IPA and use that to train the G2P model? (I probably need that for German).

Zamia has some IPA dicts in the original form as well: https://github.com/gooofy/zamia-speech/tree/master/data/src/dicts

[EDIT] Guenter has some Python scripts as well, e.g.: xsampa2ip have you seen them?
[EDIT2] ... and pre-trained sequitur G2P but I don't know if they work with Phonetisaurus 🤔

fquirin Jun 28, 2021
Maintainer Author

For converting back and forth between phoneme sets, I've recently added support for this to gruut-ipa. It hasn't been thoroughly tested yet, but maybe this could be an opportunity

I've created an Issue about it in your repo :-) : rhasspy/gruut-ipa#2

fquirin · 2021-11-01T11:15:15Z

fquirin
Nov 1, 2021
Maintainer Author

@synesthesiam , @DanBmh
The STT server has been released a while ago and I'm pretty happy with the results 😃 . Documentation about the interface is still bad but there is a Javascript example which is OK-readable I guess 😅 . Let me know if you'd be interested in a more detailed documentation (that would appear here).

Thanks to @domcross I've made some progress with Scribosermo as well and managed to build a test-system on aarch64 (Raspberry Pi4) and am64 🥳 . I've collected all my Scribosermo knowledge here: https://github.com/fquirin/scribosermo-stt-setup
Next step would be to write a Scribosermo implementation of the engine interface like the one I made for Vosk. Maybe someone is interested to help out here? ^^. My hope is that the Scribosermo stream demo can work as an example to implement the real-time chunk processing.

@DanBmh There is some confusion on my side regarding the Scribosermo model checkpoints. Can the German QuartzNet actually be used in an open-source project like SEPIA or are the Nvidia terms-of-use prohibiting that? 🤔

0 replies

DanBmh · 2021-11-02T08:35:14Z

DanBmh
Nov 2, 2021

Great to read 👍

Regarding your question about the model license, I think there should be no problem if you use them in an open source project, but I'm not completely sure. I didn't fully understand Nvidia's terms-of-use and if they still apply after training over the weights, because you get a different model then.

1 reply

fquirin Nov 4, 2021
Maintainer Author

I didn't fully understand Nvidia's terms-of-use and if they still apply after training over the weights

That was exactly the point I didn't understand either 😕
There are some strange comments as well about platforms that use the code need to have Nvidia hardware or something (which may apply only to the usage of their NGC platform to host models, idk).
It would be sad to have these great models but being unable to use them outside of private experiments :-(

fquirin · 2021-11-04T16:57:49Z

fquirin
Nov 4, 2021
Maintainer Author

The API docs have been updated to describe the general process flow and the 'welcome' event (authentication, model/language selection, mode, grammar, etc.), just in case someone was interesting in building a Python or Node.js client ;-) 😁 ... .
I'll try to add more info about sending audio-chunks and getting partial/final results soon.

0 replies

fquirin · 2022-07-10T20:04:25Z

fquirin
Jul 10, 2022
Maintainer Author

I've updated the STT-Server repository with the BETA version of a Python client library and demo script that demonstrates STT on the command line 🙂 .
Oh btw full Coqui-STT support has been integrated as well 😉

0 replies

Draft: New WebSocket, duplex, streaming audio STT server for Zamia, Vosk, etc. #112

fquirin Apr 27, 2021 Maintainer

Replies: 14 comments · 29 replies

Training

Training

fquirin Apr 28, 2021 Maintainer Author

fquirin Apr 29, 2021 Maintainer Author

fquirin Apr 29, 2021 Maintainer Author

fquirin Apr 29, 2021 Maintainer Author

fquirin May 10, 2021 Maintainer Author

Option 1: Websocket server ingest "self-consistent" websockets requests

Option 2: HTTP/JSON API server "spawn" websockets

fquirin May 11, 2021 Maintainer Author

fquirin May 13, 2021 Maintainer Author

fquirin May 13, 2021 Maintainer Author

fquirin May 12, 2021 Maintainer Author

fquirin May 13, 2021 Maintainer Author

fquirin May 13, 2021 Maintainer Author

fquirin Jun 24, 2021 Maintainer Author

fquirin Jun 25, 2021 Maintainer Author

fquirin Jun 27, 2021 Maintainer Author

fquirin Jun 27, 2021 Maintainer Author

fquirin Jun 28, 2021 Maintainer Author

fquirin Nov 1, 2021 Maintainer Author

fquirin Nov 4, 2021 Maintainer Author

fquirin Nov 4, 2021 Maintainer Author

fquirin Jul 10, 2022 Maintainer Author

fquirin
Apr 27, 2021
Maintainer

Replies: 14 comments 29 replies

fquirin
Apr 28, 2021
Maintainer Author

fquirin
Apr 29, 2021
Maintainer Author

fquirin Apr 29, 2021
Maintainer Author

fquirin Apr 29, 2021
Maintainer Author

fquirin May 10, 2021
Maintainer Author

fquirin
May 11, 2021
Maintainer Author

fquirin May 13, 2021
Maintainer Author

fquirin May 13, 2021
Maintainer Author

fquirin May 12, 2021
Maintainer Author

fquirin May 13, 2021
Maintainer Author

fquirin May 13, 2021
Maintainer Author

fquirin
Jun 24, 2021
Maintainer Author

fquirin Jun 25, 2021
Maintainer Author

fquirin Jun 27, 2021
Maintainer Author

fquirin Jun 27, 2021
Maintainer Author

fquirin Jun 28, 2021
Maintainer Author

fquirin
Nov 1, 2021
Maintainer Author

fquirin Nov 4, 2021
Maintainer Author

fquirin
Nov 4, 2021
Maintainer Author

fquirin
Jul 10, 2022
Maintainer Author