Official repository for Rhythm Modeling for Voice Conversion. Audio samples can be found here. Colab demo can be found here.
Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic - an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments.Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
Note: Urhythmic builds on soft speech units from our paper A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion.
import torch, torchaudio
# Load the HubertSoft content encoder (see https://github.com/bshall/hubert/)
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft").cuda()
# Select the source and target speakers
# Pretrained models are available for:
# VCTK: p228, p268, p225, p232, p257, p231.
# and LJSpeech.
source, target = "p231", "p225"
# Load Urhythmic (either urhythmic_fine or urhythmic_global)
urhythmic, encode = torch.hub.load(
"bshall/urhythmic:main",
"urhythmic_fine",
source_speaker=source,
target_speaker=target,
)
urhythmic.cuda()
# Load the source audio
wav, sr = torchaudio.load("path/to/wav")
assert sr == 16000
wav = wav.unsqueeze(0).cuda()
# Convert to the target speaker
with torch.inference_mode():
# Extract speech units and log probabilities
units, log_probs = encode(hubert, wav)
# Convert to the target speaker
wav_ = urhythmic(units, log_probs)
usage: convert.py [-h] [--extension EXTENSION]
{urhythmic_fine,urhythmic_global} source-speaker target-speaker in-dir out-dir
Convert audio samples using Urhythmic.
positional arguments:
{urhythmic_fine,urhythmic_global}
available models (Urhythmic-Fine or Urhythmic-Global).
source-speaker the source speaker: p228, p268, p225, p232, p257, p231, LJSpeech
target-speaker the target speaker: p228, p268, p225, p232, p257, p231, LJSpeech
in-dir path to the dataset directory.
out-dir path to the output directory.
options:
-h, --help show this help message and exit
--extension EXTENSION
extension of the audio files (defaults to .wav).
Here we outline the training steps for VCTK. However, it should be straight forward to extend the recipe to other datasets.
- Prepare the Dataset
- Extract Soft Speech Units and Log Probabilities
- Train the Segmenter
- Segmentation and Clustering
- Train the Rhythm Model
- Train or Finetune the Vocoder
To apply Urhythmic
to your own data, you can skip step 3. (i.e., the Segmenter
doesn't need to be re-trained).
Download and extract VCTK. Split the data into dev
, test
, and train
sets for a given speaker (e.g. p225
). The resulting directory should have the following structure:
p225
├── dev
│ ├── wavs
│ │ ├── p225_025.wav
│ │ ├── ...
│ │ ├── p225_045.wav
├── test
│ ├── wavs
│ │ ├── p225_001.wav
│ │ ├── ...
│ │ ├── p225_024.wav
├── train
│ ├── wavs
│ │ ├── p225_046.wav
│ │ ├── ...
│ │ ├── p225_366.wav
Note that for VCTK we take the first 24 parallel utterances as the test
set.
Next, resample the audio to 16kHz using the resample.py
script.
The script will replace each file with a 16kHz version so remember to copy your data if you want to keep the originals.
usage: resample.py [-h] [--sample_rate SAMPLE_RATE] in-dir
Resample an audio dataset.
positional arguments:
in-dir path to dataset directory.
options:
-h, --help show this help message and exit
--sample_rate SAMPLE_RATE
target sample rate (defaults to 16000).
For example:
python resample.py path/to/p225
Encode the dev
, test
, and train
sets using HuBERT-Soft and the encode.py
script:
usage: encode.py [-h] [--extension EXTENSION] in-dir out-dir
Encode an audio dataset into soft speech units and the log probabilities of the associated discrete units.
positional arguments:
in-dir path to the dataset directory.
out-dir path to the output directory.
options:
-h, --help show this help message and exit
--extension EXTENSION
extension of the audio files (defaults to .wav).
for example:
python encode.py path/to/p225/dev/wavs path/to/p225/dev
At this point the directory tree should look as follows:
p225
├── dev
│ ├── wavs
│ ├── soft
│ ├── logprobs
├── test
│ ├── wavs
│ ├── soft
│ ├── logprobs
├── train
│ ├── wavs
│ ├── soft
│ ├── logprobs
Cluster the discrete speech units and identify the cluster id corresponding to sonorants, obstruents, and silences using the train_segmenter.py
script:
usage: train_segmenter.py [-h] dataset-dir checkpoint-path
Cluster the codebook of discrete speech units and identify the cluster id
corresponding to sonorants, obstruents, and silences.
positional arguments:
dataset-dir path to the directory of segmented speech.
checkpoint-path path to save checkpoint.
options:
-h, --help show this help message and exit
for example:
python train_segmenter.py path/to/p225/dev/ path/to/checkpoints/segmenter.pt
Segment the dev
, test
, and train
sets using the segment.py
script.
Note, this script uses the segmenter checkpoint.
You'll need to adapt the script to use your own checkpoint if required.
usage: segment.py [-h] in-dir out-dir
Segment an audio dataset.
positional arguments:
in-dir path to the log probability directory.
out-dir path to the output directory.
options:
-h, --help show this help message and exit
for example:
python segment.py path/to/p225/dev/logprobs path/to/p225/dev/segments/
At this point the directory tree should look as follows:
p225
├── dev
│ ├── wavs
│ ├── soft
│ ├── logprobs
│ ├── segments
├── test
│ ├── wavs
│ ├── soft
│ ├── logprobs
│ ├── segments
├── train
│ ├── wavs
│ ├── soft
│ ├── logprobs
│ ├── segments
Train the fine-grained or global rhythm model using the train_rhythm_model.py
script:
usage: train_rhythm_model.py [-h] {fine,global} dataset-dir checkpoint-path
Train the FineGrained or Global rhythm model.
positional arguments:
{fine,global} type of rhythm model (fine-grained or global).
dataset-dir path to the directory of segmented speech.
checkpoint-path path to save checkpoint.
options:
-h, --help show this help message and exit
for example:
python train_rhythm_model.py fine path/to/p225/train/segments path/to/checkpoints/rhythm-fine-p225.pt
Train or finetune the HiFiGAN vocoder. We recommend finetuning from the LJSpeech checkpoint.
usage: train_vocoder.py [-h] [--resume RESUME] [--finetune | --no-finetune] dataset-dir checkpoint-dir
Train or finetune the HiFiGAN vocoder.
positional arguments:
dataset-dir path to the preprocessed data directory
checkpoint-dir path to the checkpoint directory
options:
-h, --help show this help message and exit
--resume RESUME path to the checkpoint to resume from
--finetune, --no-finetune
whether to finetune
For example, to train from scratch:
python train_vocoder.py /path/to/p225 /path/to/checkpoints
To finetune, download the LJSpeech checkpoint and run:
python train_vocoder.py /path/to/p225 /path/to/checkpoints --resume hifigan-LJSpeech-ceb1368d.pt --finetune
If you found this work helpful please consider citing our paper.