Skip to content

Commit

Permalink
add structure and references
Browse files Browse the repository at this point in the history
  • Loading branch information
KdaiP committed Apr 2, 2024
1 parent c3ab841 commit 327379e
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 1 deletion.
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,20 @@ Feel free to explore and modify settings in `config.py` to modify the hyperparam
| StableTTS | text to mel | Model is currently in training...|
| Vocos | mel to wav | [🤗](https://huggingface.co/KdaiP/StableTTS/blob/main/vocos.pt)|

## Model structure

<div align="center">

<p style="text-align: center;">
<img src="./figures/structure.jpg" height="512"/>
</p>

</div>

- We use the Diffusion Convolution Transformer block from [Hierspeech++](https://github.com/sh-lee-prml/HierSpeechpp), which is a combination of original [DiT](https://github.com/sh-lee-prml/HierSpeechpp) and [FFT](https://arxiv.org/pdf/1905.09263.pdf)(Feed forward Transformer from fastspeech) for better prosody.

- In flow-matching decoder, we add a [FiLM layer](https://arxiv.org/abs/1709.07871) before DiT block to condition timestep embedding into model.

## References

The development of our models heavily relies on insights and code from various projects. We express our heartfelt thanks to the creators of the following:
Expand All @@ -58,7 +72,7 @@ The development of our models heavily relies on insights and code from various p

[Stable Diffusion 3](https://stability.ai/news/stable-diffusion-3): Idea of combining flow-matching and DiT.

[Vits](https://github.com/jaywalnut310/vits): Code style and MAS insights.
[Vits](https://github.com/jaywalnut310/vits): Code style and MAS insights, DistributedBucketSampler.

### Additional References:

Expand All @@ -70,6 +84,8 @@ The development of our models heavily relies on insights and code from various p

[gpt-sovits](https://github.com/RVC-Boss/GPT-SoVITS): melstyle encoder for voice clone

[diffsinger](https://github.com/openvpi/DiffSinger): chinese three section phoneme scheme for chinese g2p

## TODO

- [ ] Release pretrained models.
Expand Down
1 change: 1 addition & 0 deletions datas/sampler.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import torch

# reference: https://github.com/jaywalnut310/vits/blob/main/data_utils.py
class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler):
"""
Maintain similar input lengths in a batch.
Expand Down
Binary file added figures/structure.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 327379e

Please sign in to comment.