With this tool you can create accurate text-audio alignments given a bunch of audio files and their transcription. The alignments can for example be used to train text-to-speech models such as FastSpeech. In comparison to other forced alignment tools this repo has following advantages:
- Multilingual: By design, the DFA is language-agnostic and can align both characters or phonemes.
- Robustness: The alignment extraction is highly tolerant against text errors and silent characters.
- Convenience: Easy installation with no extra dependencies. You can provide your own data in the standard LJSpeech format without special preprocessing (such as applying phonetic dictionaries, non-speech annotations etc.).
The approach is based on training a simple speech recognition model with CTC loss on mel spectrograms extracted from the wav files.
Running on Python >=3.6
pip install -r requirements.txt
Check out the following demo notebook for training and character duration extraction on the LJSpeech dataset:
(1) Download the LJSpeech dataset, set paths in config.yaml:
dataset_dir: LJSpeech
metadata_path: LJSpeech/metadata.csv
(2) Preprocess the data and train aligner:
python preprocess.py
python train.py
(3) Extract durations with latest model checkpoint (60k steps should be sufficient):
python extract_durations.py
By default durations are put as numpy files into:
output/durations
Each character duration correspons to one mel time step, which translates to hop_length / sample_rate seconds in the wav file.
You can monitor the training with
tensorboard dfa_checkpoints
Just bring your dataset to the LJSpeech format. We recommend to clean and preprocess the text in the metafile.csv before running the DFA, e.g. lower-case, phonemization etc.
You can provide your own mel spectrogams by setting in the config.yaml:
precomputed_mels: /path/to/mels
Make sure that the mel names match the ids in the metafile, e.g.
00001.mel ---> 00001|First sample text