Skip to content

A .NET server that start a speech recognition using grammar provided by the client.

Notifications You must be signed in to change notification settings

rgoupil/speech-net-cs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speech-net-cs

A .NET server that start a speech recognition using grammar provided by the client.

It was created for a Unity project that required offline speech recognition and the Speech-to-Text UnityLabs solution was a real pain to setup. I also needed to generate the grammar from alternatives and the .NET Choiches and GrammarBuilder looked like a good fit. The .NET System.Speech assembly start from .NET 3.0, whereas Unity only use 2.0 .NET assemblies (maybe 2.5 but that is still not good enough for me). I created a separate binary, spawned by the Unity project, that communicate with a MonoBehaviour running in the project using TCP.

By design the Server will wait for 10s after accepting a Client for the handshake. After that time, the Server exit. The Server will also exit if the connection with the Client is lost. As the Server is spawned by the Unity project, I wanted to make sure that the Server process won't become a zombie in case the game crash. If the Server crash, the client continuously try to spawn it again.

The server only send the semantic sentences, which allow the receiver to easily understand the result of the speech when multiple phrases mean the same thing. For instance: "I am good", "I am fine", "I am OK" and simply "good" could all be understood as "user_ok", making easier to use the final result.

Please feel free to send PR if you happen to improve this.

How to setup

Interfaces

Unity

Copy/paste the files MarvinConsumer.cs and MarvinStarter.cs from Interfaces/Unity/ into your Unity project's assets. MarvinStarter is a component that spawn the Server process. You must specify the path relative to the project asset data folder to the binary in the component. MarvinConsumer is a component to connect to the Server, send the grammar and receive the semantic when speech is recognized.

Both components need to be present on an entity in the scene - putting more than one of each will result in the random destruction of the components until there is no more than one of each.

Other

If you happen to create an interface that we are missing consider sending a PR so that it can be merged here.

Marvin Interface

I strongly suggest adding the MarvinInterface project to your Visual Studio Unity project.
If needed, replace the path in the post-build event of the MarvinInterface project to copy the binary to the correct place.

It contains the classes Configuration (explicit enough), Phrase (described later), Utils (only used for serialization/deserialization helpers so far) and HandshakeRequest (sent by the Client to the Server, contains the grammar).

Marvin Server

I strongly suggest adding the MarvinServer project to your Visual Studio Unity project.
If needed, replace the path in the post-build event of the MarvinServer project to copy the binary to the correct place. Add a reference to the MarvinInterface project.

If you followed these step correctly, you should now be able to hop in Unity, start the scene with the two components MarvinStarter and MarvinConsumer and test the voice recognition. The initial project being a tactical shooter, the default grammar is loaded with a set of rules of subject and orders. To make sure everything is setup correctly, you should now be able to say various orders such as:

  • Red drop a lightstick
  • Everyone follow me
  • Gold toss a bang
  • Blue breach and clear
  • open bang and wait

Grammar syntax

Phrases

Phrases are a set of words too small or missing parts to make a valid sentence. The class Phrase allow to declare a phrase with optional words, wildcards and assign it a semantic value. The Phrase constructor take the semantic value, followed by the phrase:

Phrase phrase = new Phrase("cmd_open", "open");
Phrase phrase = new Phrase("cmd_open", "start");
Phrase phrase = new Phrase("cmd_open", "execute");

The syntax for the phrase is:

  • word: written with alphanumerical characters or underscore and can be separated by spaces, required by the speech => jump there
  • optional word: a word preceded by an interogation mark and not separated by space, not required in the speech => ?jump
  • anything: written as an ellipsis, can take the value of any speech but can lead to ambiguity if not careful => ...

A more complex phrase could be:

Phrase phrase = new Phrase("cmd_lightstick", "deploy ... ?light stick ... ?and");

Resulting in the following permutations:

  • deploy ... light stick ... and
  • deploy ... light stick ...
  • deploy ... stick ... and
  • deploy ... stick ...

Sentences

While sentences are normally a well defined arrangement of words, the definition for this project is a bit loose. Here, the sentence is defined as an arrangement of semantic values (from already defined phrases) but also allowing to use value(s), optional value(s) and wildcard. The Sentence constructor take a sentence, later used in conjuction with the Phrase(s) declared to generate a full permutation set of possible recognizable sentences.

Sentence sentence = new Sentence("... cmd_open");

The syntax for the sentence is:

  • value: written with alphanumerical characters or underscore, required by the speech => jump
  • values: multiple value surrounded by parenthesis and separated by a pipe => (jump|run|cmd_open)
  • optional value: a value preceded by an interogation mark and not separated by space, not required in the speech => ?jump
  • optional values: values preceded by anow interogation mark, not required in the speech => ?(jump|run|cmd_open)
  • anything: written as an ellipsis, can take the value of any speech but can lead to ambiguity if not careful => ...

A more complex sentence could be:

Sentence sentence = new Sentence("... ?(subject_all|subject_this) (cmd_open|cmd_lightstick) cmd_follow");

Resulting in the following permutations, where each word could be pointing to one or multiple phrases (resulting in even more sentences handed to the Speech Engine):

  • ... subject_all cmd_open cmd_follow
  • ... subject_all cmd_lightstick cmd_follow
  • ... subject_this cmd_open cmd_follow
  • ... subject_this cmd_lightstick cmd_follow
  • ... cmd_open cmd_follow
  • ... cmd_lightstick cmd_follow

Calculating permutations number

While the server keep track of the total for you, it is good to have an idea how the impact each elements can have on the final number of sentences generated.

Total number of phrases created by one Phrase:

Given a Phrase p made of n words. Where the function W(i) return the following values based on the type of i:

  • Word: 1
  • Optional: 2
  • Anything: 1

The total number of phrases alt text created by one Phrase is:

alt text

Total number of sentences created by one Sentence:

Given a Sentence made of m elements. Where the function E(i) return the following values based on the type of i:

  • Value - not Phrase: 1
  • Value - Phrase: alt text
  • Values: with e elements => alt text
  • Optional: alt text
  • Optionals: alt text
  • Anything: 1

The total number of sentences alt text created by one Sentence is:

alt text

To sumarize, be careful when using Optionals made of Phrases with a large number of permutations as these are in the end generating the most sentences.

To see the number of semantic sentences and actual number of sentences sent to the Speech Engine, you can examinate the value of splittedSentences.Count and choichesCount in the file MarvinServer:SpeechToText.cs at the end of the function private void UpdateGrammar().

TODO:

  • MarvinConsumer to use external grammar in XML or JSON for faster iterations and the possibility for the final user to modify it (modding <3)
  • Adding "Group" to the Sentence syntax or as a Class to allow grouping several Phrases into one to make the grammar sentence declaration shorter and easier to read
  • Send both the understood text and the semantic text on speech recognized
  • Allow to read the speech catched by the Anything in the Phrase and Sentence syntax - may be useful in case the user has a free choice such as number or color.
    Maybe with seperate syntax such as new Phrase("prompt_number", "...{number}") used like this new Sentence("set width ?to prompt_number measure_unit").
  • Add repeater to the Sentence and Phrase syntax such as the Regex one new Sentence("(cmd_open|cmd_follow){2}") to repeat the group 2 times and new Sentence("(cmd_open|cmd_follow){1,3}") to repeat between 1 to 3 times for instance (for 0 times as minimum, just make it optional)
  • Unity Server process spawner to observe a cooldown or a max retry count, followed by an event to inform external components (e.g. show error message to user that something is wrong with the Marvin Server)
  • Better connection integrity detection. A simple ping should already help a lot
  • Add an abstract layer between the Server and the speech recognition to allow adding other speech libraries
  • Evaluate alternate offline speech engine such as CMUSphinx

About

A .NET server that start a speech recognition using grammar provided by the client.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages