Added support for lip-sync in English

met4citizen · Jan 9, 2024 · 0bb704a · 0bb704a
1 parent 1736fe0
commit 0bb704a
Show file tree

Hide file tree

Showing 5 changed files with 661 additions and 150 deletions.
diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ the JSON Web Token needed to use that proxy (See Appendix B).
 // Create the talking head avatar
 const nodeAvatar = document.getElementById('avatar');
 head = new TalkingHead( nodeAvatar, {
-  ttsEndpoint: "./gtts/",
+  ttsEndpoint: "/gtts/",
   jwtGet: jwtGet
 });
 ```
@@ -77,6 +77,7 @@ Option | Description
 `ttsVolume` | Google text-to-speech volume gain (in dB) in the range [-96.0, 16.0]. Default is `0`.
 `ttsTrimStart` | Trim the viseme sequence start relative to the beginning of the audio (shift in milliseconds). Default is `0`.
 `ttsTrimEnd`| Trim the viseme sequence end relative to the end of the audio (shift in milliseconds). Default is `300`.
+`lipsyncLang`| Lip-sync language. Currently 'en' and 'fi' are supported. Default is `fi`.
 `pcmSampleRate` | PCM (signed 16bit little endian) sample rate used in `speakAudio` in Hz. Default is `22050`.
 `modelPixelRatio` | Sets the device's pixel ratio. Default is `1`.
 `modelFPS` | Frames per second. Default is `30`.
@@ -117,9 +118,9 @@ The following table lists some of the key methods. See the source code for the r
 
 Method | Description
 --- | ---
-`showAvatar(avatar, [onprogress=null])` | Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
+`showAvatar(avatar, [onprogress=null])` | Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `visemesLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
 `setView(view, [opt])` | Set view. Supported views are `"full"`, `"upper"`  and `"head"`. Options `opt` can be used to set `cameraDistance`, `cameraX`, `cameraY`, `cameraRotateX`, `cameraRotateY`.
-`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])` | Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
+`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])` | Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
 `speakAudio(audio, [onsubtitles=null])` | Add the `audio` object to the speech queue. This method was added to support external TTS services such as ElevenLabs WebSocket API. The audio object contains ArrayBuffer chunks in `audio` array, characters in `chars` array, starting times for each character in milliseconds in `ts` array, and durations for each character in milliseconds in `ds` array. As of now, the only supported format is PCM signed 16bit little endian.
 `speakMarker(onmarker)` | Add a marker to the speech queue. The callback function `onmarker` is called when the queue processes the event.
 `lookAt(x,y,t)` | Make the avatar's head turn to look at the screen position (`x`,`y`) for `t` milliseconds.
@@ -252,6 +253,15 @@ the Web Speech API events for syncronization, but the results were not good.
 Note that the ElevenLabs WebSocket API returns the word-to-audio
 alignment information, which is great for this purpose.
 
+**Any future plans for the project?**
+
+This is a small side-project for me, so I don't have any big plans for it.
+That said, there are some companies that are currently developing
+text-to-avatar and text-to-animation features. If and when they get released
+as APIs, I will probably take a look at them and see if they can be integrated
+in some way to the project.
+
+
 ---
 
 ### See also
@@ -301,7 +311,7 @@ RewriteEngine On
 RewriteMap jwtverify "prg:/etc/httpd/jwtverify" apache:apache
 ```
 
-4. Make a forward proxy for each service in which you add the required API key and protect the proxy with the JWT token verifier. Below is an example config for OpenAI API proxy using Apache 2.4 web server. Google TTS proxy would follow the same pattern passing the request to `https://eu-texttospeech.googleapis.com/v1/text:synthesize` (in EU).
+4. Make a proxy for each service in which you add the required API key and protect the proxy with the JWT token verifier. Below is an example config for OpenAI API proxy using Apache 2.4 web server. Google TTS proxy would follow the same pattern passing the request to `https://eu-texttospeech.googleapis.com/v1/text:synthesize` (in EU).
 
 ```apacheconf
 # OpenAI API

diff --git a/index.html b/index.html
@@ -441,15 +441,15 @@
 }
 
 // i18n
-// NOTE: Default UI language is Finnish
+// Default UI language is English
 
 function i18nWord(w,l) {
-  l = l || cfg('theme-lang') || 'fi';
+  l = l || cfg('theme-lang') || 'en';
   return (( i18n[l] && i18n[l][w] ) ? i18n[l][w] : w);
 }
 
 function i18nTranslate(l) {
-  l = l || cfg('theme-lang') || 'fi';
+  l = l || cfg('theme-lang') || 'en';
 
   // Text
   d3.selectAll("[data-i18n-text]").nodes().forEach( n => {
@@ -572,19 +572,37 @@
 
       // Speak audio
       if ( (r.isFinal || r.normalizedAlignment) && elevenOutputMsg ) {
-        head.speakAudio( elevenOutputMsg, node ? addText.bind(null,node) : null );
+        head.speakAudio( elevenOutputMsg, { lipsyncLang: cfg('voice-lipsync-lang') }, node ? addText.bind(null,node) : null );
         elevenOutputMsg = null;
       }
 
       if ( !r.isFinal ) {
         // New part
         if ( r.normalizedAlignment ) {
-          elevenOutputMsg = {
-            audio: [],
-            chars: r.normalizedAlignment.chars,
-            ts: r.normalizedAlignment.charStartTimesMs,
-            ds: r.normalizedAlignment.charDurationsMs
-          };
+          elevenOutputMsg = { audio: [], words: [], times: [], durations: [] };
+
+          // Parse chars to words
+          let word = '';
+          let time = 0;
+          let duration = 0;
+          for( let i=0; i<r.normalizedAlignment.chars.length; i++ ) {
+            if ( word.length === 0 ) time = r.normalizedAlignment.charStartTimesMs[i];
+            duration += r.normalizedAlignment.charDurationsMs[i];
+            if ( r.normalizedAlignment.chars[i] === ' ' ) {
+              elevenOutputMsg.words.push(word);
+              elevenOutputMsg.times.push(time);
+              elevenOutputMsg.durations.push(duration);
+              word = '';
+              duration = 0;
+            } else {
+              word += r.normalizedAlignment.chars[i];
+            }
+          }
+          if ( word.length ) {
+            elevenOutputMsg.words.push(word);
+            elevenOutputMsg.times.push(time);
+            elevenOutputMsg.durations.push(duration);
+          }
         }
 
         // Add audio content to message
@@ -635,23 +653,23 @@
   session: 0,
   sessions : [
     {
-      name: "Nimetön 1",
-      theme: { lang: 'fi', brightness:"dark", ratio:"wide", layout:"port" },
+      name: "Nimetön",
+      theme: { lang: 'en', brightness:"dark", ratio:"wide", layout:"port" },
       view: { image: 'NONE' },
       avatar: {},
       camera: { frame: 'full' },
       ai: {},
-      voice: { background: "NONE", type: "google", google:{ id: "fi-FI-Standard-A"} }
+      voice: { background: "NONE", type: "google", google:{ id: "en-GB-Standard-A"}, lipsync:{ lang: 'en' } }
     },
     {
       name: "Nimetön 2",
 
-      theme: { lang: 'fi', brightness: "dark", ratio: "wide", layout: "land" },
+      theme: { lang: 'en', brightness: "dark", ratio: "wide", layout: "land" },
       view: { image: 'NONE' },
       avatar: {},
       camera: { frame: 'upper' },
       ai: {},
-      voice: { background: "NONE", type: "google", google:{ id: "fi-FI-Standard-A"} }
+      voice: { background: "NONE", type: "google", google:{ id: "en-GB-Standard-A"}, lipsync:{ lang: 'en' } }
     },
   ]
 };
@@ -1031,6 +1049,7 @@
                   await elevenSpeak( tts.trimStart() + " ", node );
                 } else {
                   await head.speakText( tts.trimStart() + " ", {
+                    lipsyncLang: cfg('voice-lipsync-lang'),
                     ttsVoice: cfg('voice-google-id'),
                     ttsRate: cfg('voice-google-rate'),
                     ttsPitch: cfg('voice-google-pitch')
@@ -1063,6 +1082,7 @@
                     await elevenSpeak( s + " ", node );
                   } else {
                     await head.speakText( s + " ", {
+                      lipsyncLang: cfg('voice-lipsync-lang'),
                       ttsVoice: cfg('voice-google-id'),
                       ttsRate: cfg('voice-google-rate'),
                       ttsPitch: cfg('voice-google-pitch')
@@ -1275,6 +1295,7 @@
               await elevenSpeak( s + " ", node );
             } else {
               await head.speakText( s + " ", {
+                lipsyncLang: cfg('voice-lipsync-lang'),
                 ttsVoice: cfg('voice-google-id'),
                 ttsRate: cfg('voice-google-rate'),
                 ttsPitch: cfg('voice-google-pitch')
@@ -1367,6 +1388,7 @@
             elevenSpeak( "" );
           } else {
             head.speakText( text, {
+              lipsyncLang: cfg('voice-lipsync-lang'),
               ttsVoice: cfg('voice-google-id'),
               ttsRate: cfg('voice-google-rate'),
               ttsPitch: cfg('voice-google-pitch')
@@ -1950,6 +1972,7 @@
         elevenSpeak( "" );
       } else {
         head.speakText( text, {
+          lipsyncLang: cfg('voice-lipsync-lang'),
           ttsVoice: cfg('voice-google-id'),
           ttsRate: cfg('voice-google-rate'),
           ttsPitch: cfg('voice-google-pitch')
@@ -2394,6 +2417,12 @@
     e.classed('selected',true);
   });
 
+  d3.selectAll("[data-voice-lipsync-lang]").on('click.command', function(ev) {
+    const e = d3.select(this);
+    d3.selectAll("[data-voice-lipsync-lang]").classed('selected',false);
+    e.classed('selected',true);
+  });
+
   d3.select("[data-voice-mixerbg]").on('input.command change.command keyup.command', function(ev) {
     let gain = parseFloat( d3.select("[data-voice-mixerbg]").property("value") );
     head.setMixerGain( null, gain );
@@ -2912,28 +2941,6 @@
             </div>
           </div>
 
-          <div class="vbar"></div>
-
-          <div class="row">
-            <div class="text label" data-i18n-text="Theme"></div>
-            <div class="column">
-              <div id="languages" class="rowWrap">
-              </div>
-              <div class="rowWrap">
-                <div class="command selected" data-i18n-text="Dark" data-item="theme-brightness" data-type="option" data-theme-brightness="dark"></div>
-                <div class="command" data-i18n-text="Light" data-item="theme-brightness" data-type="option" data-theme-brightness="light"></div>
-              </div>
-              <div class="rowWrap">
-                <div class="command selected" data-i18n-text="theme-wide" data-item="theme-ratio" data-type="option" data-theme-ratio="wide"></div>
-                <div class="command" data-i18n-text="theme-43" data-item="theme-ratio" data-type="option" data-theme-ratio="normal"></div>
-              </div>
-              <div class="rowWrap">
-                <div class="command selected" data-i18n-text="theme-landscape" data-item="theme-layout" data-type="option" data-theme-layout="land"></div>
-                <div class="command" data-i18n-text="theme-portrait" data-item="theme-layout" data-type="option" data-theme-layout="port"></div>
-              </div>
-            </div>
-          </div>
-
         </div>
 
         <div id="right-avatar" class="page noselect nodrag hidden">
@@ -3245,6 +3252,15 @@
               </div>
             </div>
           </div>
+          <div class="row">
+            <div class="text label" data-i18n-text="Lip-sync"></div>
+            <div class="column">
+              <div class="rowWrap">
+                <div class="command selected" data-i18n-text="en" data-item="voice-lipsync-lang" data-type="option" data-voice-lipsync-lang="en"></div>
+                <div class="command" data-i18n-text="fi" data-item="voice-lipsync-lang" data-type="option" data-voice-lipsync-lang="fi"></div>
+              </div>
+            </div>
+          </div>
           <div class="row">
             <div class="text label" data-i18n-text="voice-test"></div>
             <div id="playtest" class="command">
@@ -3360,6 +3376,26 @@
             </div>
           </div>
 
+          <div class="row">
+            <div class="text label" data-i18n-text="Theme"></div>
+            <div class="column">
+              <div id="languages" class="rowWrap">
+              </div>
+              <div class="rowWrap">
+                <div class="command selected" data-i18n-text="Dark" data-item="theme-brightness" data-type="option" data-theme-brightness="dark"></div>
+                <div class="command" data-i18n-text="Light" data-item="theme-brightness" data-type="option" data-theme-brightness="light"></div>
+              </div>
+              <div class="rowWrap">
+                <div class="command selected" data-i18n-text="theme-wide" data-item="theme-ratio" data-type="option" data-theme-ratio="wide"></div>
+                <div class="command" data-i18n-text="theme-43" data-item="theme-ratio" data-type="option" data-theme-ratio="normal"></div>
+              </div>
+              <div class="rowWrap">
+                <div class="command selected" data-i18n-text="theme-landscape" data-item="theme-layout" data-type="option" data-theme-layout="land"></div>
+                <div class="command" data-i18n-text="theme-portrait" data-item="theme-layout" data-type="option" data-theme-layout="port"></div>
+              </div>
+            </div>
+          </div>
+
           <div class="vbar"></div>
 
           <div class="row">