TTS without Text?

As I understand it, this tts-algorithm works with your audio files without assigned text.

1. How would it understand the content, language?
2. Is it working with the lj-speech set only or a dataset in lj-speech structure?