As I understand it, this tts-algorithm works with your audio files without assigned text. 1. How would it understand the content, language? 2. Is it working with the lj-speech set only or a dataset in lj-speech structure?