FastSpeech2 + DaftExprt style Refernce Encoder with conditioning on Encoder, Variance Adapter and Decoder.
The issue during mismatch of mask and self attn shape is resolved, however, the results are not good. This is some unknown bug, despite the training logs showing promising results.
