Skip to content

Commit

Permalink
Added support for lip-sync in English
Browse files Browse the repository at this point in the history
  • Loading branch information
met4citizen committed Jan 9, 2024
1 parent 1736fe0 commit 0bb704a
Show file tree
Hide file tree
Showing 5 changed files with 661 additions and 150 deletions.
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ the JSON Web Token needed to use that proxy (See Appendix B).
// Create the talking head avatar
const nodeAvatar = document.getElementById('avatar');
head = new TalkingHead( nodeAvatar, {
ttsEndpoint: "./gtts/",
ttsEndpoint: "/gtts/",
jwtGet: jwtGet
});
```
Expand All @@ -77,6 +77,7 @@ Option | Description
`ttsVolume` | Google text-to-speech volume gain (in dB) in the range [-96.0, 16.0]. Default is `0`.
`ttsTrimStart` | Trim the viseme sequence start relative to the beginning of the audio (shift in milliseconds). Default is `0`.
`ttsTrimEnd`| Trim the viseme sequence end relative to the end of the audio (shift in milliseconds). Default is `300`.
`lipsyncLang`| Lip-sync language. Currently 'en' and 'fi' are supported. Default is `fi`.
`pcmSampleRate` | PCM (signed 16bit little endian) sample rate used in `speakAudio` in Hz. Default is `22050`.
`modelPixelRatio` | Sets the device's pixel ratio. Default is `1`.
`modelFPS` | Frames per second. Default is `30`.
Expand Down Expand Up @@ -117,9 +118,9 @@ The following table lists some of the key methods. See the source code for the r

Method | Description
--- | ---
`showAvatar(avatar, [onprogress=null])` | Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
`showAvatar(avatar, [onprogress=null])` | Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `visemesLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
`setView(view, [opt])` | Set view. Supported views are `"full"`, `"upper"` and `"head"`. Options `opt` can be used to set `cameraDistance`, `cameraX`, `cameraY`, `cameraRotateX`, `cameraRotateY`.
`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])` | Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])` | Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
`speakAudio(audio, [onsubtitles=null])` | Add the `audio` object to the speech queue. This method was added to support external TTS services such as ElevenLabs WebSocket API. The audio object contains ArrayBuffer chunks in `audio` array, characters in `chars` array, starting times for each character in milliseconds in `ts` array, and durations for each character in milliseconds in `ds` array. As of now, the only supported format is PCM signed 16bit little endian.
`speakMarker(onmarker)` | Add a marker to the speech queue. The callback function `onmarker` is called when the queue processes the event.
`lookAt(x,y,t)` | Make the avatar's head turn to look at the screen position (`x`,`y`) for `t` milliseconds.
Expand Down Expand Up @@ -252,6 +253,15 @@ the Web Speech API events for syncronization, but the results were not good.
Note that the ElevenLabs WebSocket API returns the word-to-audio
alignment information, which is great for this purpose.

**Any future plans for the project?**

This is a small side-project for me, so I don't have any big plans for it.
That said, there are some companies that are currently developing
text-to-avatar and text-to-animation features. If and when they get released
as APIs, I will probably take a look at them and see if they can be integrated
in some way to the project.


---

### See also
Expand Down Expand Up @@ -301,7 +311,7 @@ RewriteEngine On
RewriteMap jwtverify "prg:/etc/httpd/jwtverify" apache:apache
```

4. Make a forward proxy for each service in which you add the required API key and protect the proxy with the JWT token verifier. Below is an example config for OpenAI API proxy using Apache 2.4 web server. Google TTS proxy would follow the same pattern passing the request to `https://eu-texttospeech.googleapis.com/v1/text:synthesize` (in EU).
4. Make a proxy for each service in which you add the required API key and protect the proxy with the JWT token verifier. Below is an example config for OpenAI API proxy using Apache 2.4 web server. Google TTS proxy would follow the same pattern passing the request to `https://eu-texttospeech.googleapis.com/v1/text:synthesize` (in EU).

```apacheconf
# OpenAI API
Expand Down
110 changes: 73 additions & 37 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -441,15 +441,15 @@
}

// i18n
// NOTE: Default UI language is Finnish
// Default UI language is English

function i18nWord(w,l) {
l = l || cfg('theme-lang') || 'fi';
l = l || cfg('theme-lang') || 'en';
return (( i18n[l] && i18n[l][w] ) ? i18n[l][w] : w);
}

function i18nTranslate(l) {
l = l || cfg('theme-lang') || 'fi';
l = l || cfg('theme-lang') || 'en';

// Text
d3.selectAll("[data-i18n-text]").nodes().forEach( n => {
Expand Down Expand Up @@ -572,19 +572,37 @@

// Speak audio
if ( (r.isFinal || r.normalizedAlignment) && elevenOutputMsg ) {
head.speakAudio( elevenOutputMsg, node ? addText.bind(null,node) : null );
head.speakAudio( elevenOutputMsg, { lipsyncLang: cfg('voice-lipsync-lang') }, node ? addText.bind(null,node) : null );
elevenOutputMsg = null;
}

if ( !r.isFinal ) {
// New part
if ( r.normalizedAlignment ) {
elevenOutputMsg = {
audio: [],
chars: r.normalizedAlignment.chars,
ts: r.normalizedAlignment.charStartTimesMs,
ds: r.normalizedAlignment.charDurationsMs
};
elevenOutputMsg = { audio: [], words: [], times: [], durations: [] };

// Parse chars to words
let word = '';
let time = 0;
let duration = 0;
for( let i=0; i<r.normalizedAlignment.chars.length; i++ ) {
if ( word.length === 0 ) time = r.normalizedAlignment.charStartTimesMs[i];
duration += r.normalizedAlignment.charDurationsMs[i];
if ( r.normalizedAlignment.chars[i] === ' ' ) {
elevenOutputMsg.words.push(word);
elevenOutputMsg.times.push(time);
elevenOutputMsg.durations.push(duration);
word = '';
duration = 0;
} else {
word += r.normalizedAlignment.chars[i];
}
}
if ( word.length ) {
elevenOutputMsg.words.push(word);
elevenOutputMsg.times.push(time);
elevenOutputMsg.durations.push(duration);
}
}

// Add audio content to message
Expand Down Expand Up @@ -635,23 +653,23 @@
session: 0,
sessions : [
{
name: "Nimetön 1",
theme: { lang: 'fi', brightness:"dark", ratio:"wide", layout:"port" },
name: "Nimetön",
theme: { lang: 'en', brightness:"dark", ratio:"wide", layout:"port" },
view: { image: 'NONE' },
avatar: {},
camera: { frame: 'full' },
ai: {},
voice: { background: "NONE", type: "google", google:{ id: "fi-FI-Standard-A"} }
voice: { background: "NONE", type: "google", google:{ id: "en-GB-Standard-A"}, lipsync:{ lang: 'en' } }
},
{
name: "Nimetön 2",

theme: { lang: 'fi', brightness: "dark", ratio: "wide", layout: "land" },
theme: { lang: 'en', brightness: "dark", ratio: "wide", layout: "land" },
view: { image: 'NONE' },
avatar: {},
camera: { frame: 'upper' },
ai: {},
voice: { background: "NONE", type: "google", google:{ id: "fi-FI-Standard-A"} }
voice: { background: "NONE", type: "google", google:{ id: "en-GB-Standard-A"}, lipsync:{ lang: 'en' } }
},
]
};
Expand Down Expand Up @@ -1031,6 +1049,7 @@
await elevenSpeak( tts.trimStart() + " ", node );
} else {
await head.speakText( tts.trimStart() + " ", {
lipsyncLang: cfg('voice-lipsync-lang'),
ttsVoice: cfg('voice-google-id'),
ttsRate: cfg('voice-google-rate'),
ttsPitch: cfg('voice-google-pitch')
Expand Down Expand Up @@ -1063,6 +1082,7 @@
await elevenSpeak( s + " ", node );
} else {
await head.speakText( s + " ", {
lipsyncLang: cfg('voice-lipsync-lang'),
ttsVoice: cfg('voice-google-id'),
ttsRate: cfg('voice-google-rate'),
ttsPitch: cfg('voice-google-pitch')
Expand Down Expand Up @@ -1275,6 +1295,7 @@
await elevenSpeak( s + " ", node );
} else {
await head.speakText( s + " ", {
lipsyncLang: cfg('voice-lipsync-lang'),
ttsVoice: cfg('voice-google-id'),
ttsRate: cfg('voice-google-rate'),
ttsPitch: cfg('voice-google-pitch')
Expand Down Expand Up @@ -1367,6 +1388,7 @@
elevenSpeak( "" );
} else {
head.speakText( text, {
lipsyncLang: cfg('voice-lipsync-lang'),
ttsVoice: cfg('voice-google-id'),
ttsRate: cfg('voice-google-rate'),
ttsPitch: cfg('voice-google-pitch')
Expand Down Expand Up @@ -1950,6 +1972,7 @@
elevenSpeak( "" );
} else {
head.speakText( text, {
lipsyncLang: cfg('voice-lipsync-lang'),
ttsVoice: cfg('voice-google-id'),
ttsRate: cfg('voice-google-rate'),
ttsPitch: cfg('voice-google-pitch')
Expand Down Expand Up @@ -2394,6 +2417,12 @@
e.classed('selected',true);
});

d3.selectAll("[data-voice-lipsync-lang]").on('click.command', function(ev) {
const e = d3.select(this);
d3.selectAll("[data-voice-lipsync-lang]").classed('selected',false);
e.classed('selected',true);
});

d3.select("[data-voice-mixerbg]").on('input.command change.command keyup.command', function(ev) {
let gain = parseFloat( d3.select("[data-voice-mixerbg]").property("value") );
head.setMixerGain( null, gain );
Expand Down Expand Up @@ -2912,28 +2941,6 @@
</div>
</div>

<div class="vbar"></div>

<div class="row">
<div class="text label" data-i18n-text="Theme"></div>
<div class="column">
<div id="languages" class="rowWrap">
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="Dark" data-item="theme-brightness" data-type="option" data-theme-brightness="dark"></div>
<div class="command" data-i18n-text="Light" data-item="theme-brightness" data-type="option" data-theme-brightness="light"></div>
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="theme-wide" data-item="theme-ratio" data-type="option" data-theme-ratio="wide"></div>
<div class="command" data-i18n-text="theme-43" data-item="theme-ratio" data-type="option" data-theme-ratio="normal"></div>
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="theme-landscape" data-item="theme-layout" data-type="option" data-theme-layout="land"></div>
<div class="command" data-i18n-text="theme-portrait" data-item="theme-layout" data-type="option" data-theme-layout="port"></div>
</div>
</div>
</div>

</div>

<div id="right-avatar" class="page noselect nodrag hidden">
Expand Down Expand Up @@ -3245,6 +3252,15 @@
</div>
</div>
</div>
<div class="row">
<div class="text label" data-i18n-text="Lip-sync"></div>
<div class="column">
<div class="rowWrap">
<div class="command selected" data-i18n-text="en" data-item="voice-lipsync-lang" data-type="option" data-voice-lipsync-lang="en"></div>
<div class="command" data-i18n-text="fi" data-item="voice-lipsync-lang" data-type="option" data-voice-lipsync-lang="fi"></div>
</div>
</div>
</div>
<div class="row">
<div class="text label" data-i18n-text="voice-test"></div>
<div id="playtest" class="command">
Expand Down Expand Up @@ -3360,6 +3376,26 @@
</div>
</div>

<div class="row">
<div class="text label" data-i18n-text="Theme"></div>
<div class="column">
<div id="languages" class="rowWrap">
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="Dark" data-item="theme-brightness" data-type="option" data-theme-brightness="dark"></div>
<div class="command" data-i18n-text="Light" data-item="theme-brightness" data-type="option" data-theme-brightness="light"></div>
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="theme-wide" data-item="theme-ratio" data-type="option" data-theme-ratio="wide"></div>
<div class="command" data-i18n-text="theme-43" data-item="theme-ratio" data-type="option" data-theme-ratio="normal"></div>
</div>
<div class="rowWrap">
<div class="command selected" data-i18n-text="theme-landscape" data-item="theme-layout" data-type="option" data-theme-layout="land"></div>
<div class="command" data-i18n-text="theme-portrait" data-item="theme-layout" data-type="option" data-theme-layout="port"></div>
</div>
</div>
</div>

<div class="vbar"></div>

<div class="row">
Expand Down
Loading

0 comments on commit 0bb704a

Please sign in to comment.