Using an ESP32-Box as the audio source #145

obones · 2021-12-20T13:35:01Z

obones
Dec 20, 2021

Hello,

I'm running an openHAB private instance (on an x86_64 server) which works great when it comes automating the relatively few objects I have here.
While I'd love to control it by voice, I definitely don't want anyone to listen into my home which is why I'm looking at offline options, and SEPIA came up in a few discussions on the openHAB forum.
As it happens, I have an ESP32-Box coming up in the mail and it is said to be capable of offline processing of wake words but seems to be limited to English when it comes to orders. Indeed, while end users could get used to an English wake word, they require French speaking when it comes to order sentences.
I was thus wondering if there is a way to use SEPIA to do the heavy lifting based on audio samples sent by the ESP32-box once it has recognized its wake word.
I understand that there is (fair) bit of programming to be done on the ESP32-Box to talk to SEPIA but I'm not sure what I should be targeting in SEPIA.
I mean, I looked at the documentation but apart from general principles and very nice diagrams, I could not find anything along the lines of "audio samples are to be POST or PUT to this URL". Could you tell which page I have missed?

Thanks for your help.

fquirin · 2021-12-20T21:37:37Z

fquirin
Dec 20, 2021
Maintainer

Hi,

technically I think what you have to do is implement the WebSocket interface of the SEPIA STT server. It accepts audio buffer chunks and will return transcribed text. This text could be sent to the SEPIA Assist-Server endpoint for further actions. Basically the ESP32 would be a micro-client for SEPIA.

I've been thinking about similar things to build something like a Fire-TV remote for SEPIA.

Here is a bit of documentation about the STT Server API.
What programming language can you use on the ESP32-Box?

2 replies

obones Dec 20, 2021
Author

Ok, at least there is a foundation to work with, that's nice to hear.
ESP32-Box is programmed using the ESP-IDF framework, a C++ based framework.
It seems it has a Websocket client already available: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/protocols/esp_websocket_client.html
I have not yet used that websocket client but am quite used to the HTTP client from the same framework, so it should not be much of a problem.

If I read you correctly, the process would then be the following:

ESP32-Box establishes the websocket connection to SEPIA STT server
ESP32-Box listens for wake word and identifies it with its model (aka, WakeNet)
ESP32-Box records audio data (by chunks most likely) and sends it over the websocket
ESP32-Box signals end of audio signal to STT Server
STT Server replies with the finalized interpretation as text
ESP32-Box sends that text to the SEPIA Assist-Server end point
SEPIA Assist-Server calls whatever is needed in openHAB to perform the required action.

Bearing in mind implementation details and hickups, this sounds quite reasonable to me.

fquirin Dec 20, 2021
Maintainer

Correct, that pretty much sums it up :-)

If it makes things easier one could think about sending the audio data to some remote client first and handle the rest from there. Might be useful for resampling etc.. The SEPIA client has another interface via CLEXI which is used for example to do remote login and events broadcasting etc., maybe it could be extended 🤔 .

Here is some more info about the general SEPIA-Home APIs (assist, teach and chat endpoints):
https://github.com/SEPIA-Framework/sepia-docs/tree/master/API

StuartIanNaylor · 2022-02-23T16:56:01Z

StuartIanNaylor
Feb 23, 2022

The esp32-s3-box was a limited run demonstrator and not sure if it will be on offer again.
I think it was to manufactures hey the new S3 is capable of this as they bang a collection of running libs on a s3 to showcase what its capable of.

Websockets are great as its bidrectional but extremely easy to detect the two different packets of binary and text payload.
You can make a simple deduction binary=audio & text=command so its great for both command protocol and audio rtp.
Also espressif provides an encoder for AMR-WB as sending uncompressed pcm or storing has been superceded since the days of ipods.
The AEC on the ESP32-S3-BOX isn't really AEC as with the length of the frame and tail its more like a telephony line canceler to stop double talk but that with the also simple BSS alg the overall results are good apart from much of the libs are blobs.
Also its very hardware dependent as post dac a ref signal is taken and fed to a 3rd ADC channel to sync at capture and then feed through the pipeline.
Prob its possible with a fixed rtos to count the cycles and create a circular buffer and use the I2S dac info but if you have a 4 channel ADC with spare inputs code wise its very simple to create a loopback.

I am not a fan of any KWS that doesn't allow you to use the softmax probability score of the last KW detection as that is a hugely useful piece of metadata and in an array of KWS the highest score is the best channel to use.
It should be the 1st text command string sent to an ASR queue that the 1st broadcast should trigger a short delay to check for further broadcasts in a room zone where the input stream is switched to the highest probability score as you can pretty much count on that being the clearest recording.

Its very likely the esp32-s3 will follow a similar cost curve as the esp32 and previous product and is currently new and even at its current premium price the chip itself is only approx $2.50.
The demo boards that esspressif are generally overkill with bloat and BoM and just have my fingers crossed one of the China cloners will start producing a simple ESP32-S3-AUDIO with the a 4 channel ADC & DAC.

If you look though there is a new product https://github.com/espressif/esp-box/blob/master/docs/hardware_overview/esp32_s3_box_lite/hardware_overview_for_lite.md which has a 2 channel ADC but still a 2mic version which hopefully they have changed and added a ring buffer so that mic capture and output latency is matched as it no longer has the loopback ref channel as there isn't a 3rd channel to loopback to.

So yeah interesting things are happening and the new ADC might even be cheaper than the $1+ one the had before as its likely we could get I2S mics and no need for ADC at all but still hoping the cloners will eventually do a simple ESP32-S3-AUDIO as with economies of sale network KWS could be as cheap as $5-10.
The ESP32-S3 is a near perfect cost effective device for network KWS and rather being trapped with espressif models it does have an optimised tensorflow micro lib now updated.
I am not really a fan of tiny screen voice-ai and see them as a pointless add-on unless you size up to something you dont have to hover over to be able to see what is on the screen as that kills the whole point of a hands free voice-ai and why base units are screen free with more useful 8" and above as you are supposed to be able to view from distance whilst hands free.
But hey the original ESP32-S3-Box was only $50 and guess the Lite can only get cheaper but as you guess I couldn't give a damn about a screen as for setup we have phones and bluetooth for that one off need.

0 replies

StuartIanNaylor · 2022-02-24T15:48:55Z

StuartIanNaylor
Feb 24, 2022

PS dunno where to get this but also they are doing a esp32-s3-box-lite-board which is one hell of a product name but will be really interesting to how much it costs.
https://zhuanlan.zhihu.com/p/470403993 I can find no other details and you will have to translate from Chinese

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEPIA

Using an ESP32-Box as the audio source #145

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

SEPIA

Using an ESP32-Box as the audio source #145

obones Dec 20, 2021

Replies: 3 comments · 2 replies

fquirin Dec 20, 2021 Maintainer

obones Dec 20, 2021 Author

fquirin Dec 20, 2021 Maintainer

StuartIanNaylor Feb 23, 2022

StuartIanNaylor Feb 24, 2022

obones
Dec 20, 2021

Replies: 3 comments 2 replies

fquirin
Dec 20, 2021
Maintainer

obones Dec 20, 2021
Author

fquirin Dec 20, 2021
Maintainer

StuartIanNaylor
Feb 23, 2022

StuartIanNaylor
Feb 24, 2022