This is the development of a Myanmar Text-to-Speech system with the famous End-to-End Speech Synthesis Model, Tacotron. It is a part of a thesis for B.E. Degree that I've been assigned at Yangon Technological University. My supervisor was Dr. Yuzana Win and she guided me throughout this development.
This work is licensed under the Creative Commons Attribution-NonCommercial-Share Alike 4.0 International (CC BY-NC-SA 4.0) License. View detailed info of the license.
ဤ myanmar-tts ကို educational purpose များအတွက် လွတ်လပ်စွာ အသုံးပြုနိုင်သော်လည်း commercial use case များအတွက် အသုံးပြုခွင့် ပေးမထားပါ။
Base Technology, Expa.Ai (Myanmar) kindly provided Myanmar text corpus and their amazing tool for creating speech corpus.
Speech corpus (mmSpeech as I call it) is created solely on my own with a recorder tool (as previously mentioned) and it currently contains over 5,000 recorded <text, audio>
pairs. I intend to upload the created corpus on some channel in future.
- Install Python 3
- Install TensorFlow
- Install a number of modules
pip install -r requirements.txt
-
First of all, the corpus should reside in
~/mm-tts
, although it is not a must and can easily be changed by a command line argument.mm-tts | mmSpeech | metadata.csv | wavs
-
Preprocess the data
python3 preprocess.py
After it is done, you should see the outputs in
~/mm-tts/training/
python3 train.py
If you want to restore the step from a checkpoint
python3 train.py --restore_step Number
There are some sentences defined in test.py, you may test them out with the trained model to see how good the current model is.
python3 test.py --checkpoint /path/to/checkpoint
There is a simple app implemented to try out the trained models for their performance.
python3 app.py --checkpoint /path/to/checkpoint
This will create a simple web app listening at port 4000 unless you specify.
Open up your browser and go to http://localhost:4000
, you should see a simple interface with a text input to get the text from the user.
Generated Samples are available on SoundCloud
- Google Colab which gives excellent GPU access was used for training this model.
- On average, each step tooks about 1.6 seconds and at peak, each step took about 1.2 and sometimes 1.1 seconds.
- For my thesis, I have trained this model for 150,000 steps (took me about a week).
Below is the produced loss curves from training mmSpeech for 150,000 Steps.