[paper] [demo samples] [dataset]
We introduce Expresso, a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided. The task of the Expresso Benchmark is to resynthesize the input audio using a low-bitrate discrete code that has been obtained without supervision from text.
Please go to the dataset repository to have access to the Expresso dataset.
We train unit-based hifigan vocoders using speech units obtained from HuBERT model as input. We condition the vocoder with one-hot speaker and style information of the utterance similar to this work.
We share pre-trained hifigan vocoders using HuBERT units on Expresso, LJ and VCTK datasets, conditioning on one-hot speaker and expression information (you don't need to manually download the checkpoints if using textlesslib
as shown below).
HuBERT model | Quantizer | Vocoder Data | HifiGAN Vocoder Model |
---|---|---|---|
HuBERT base LS960 | L9 km500 (LS960) | Expresso + LJ + VCTK | generator.pt - config.json - speakers.txt - styles.txt |
HuBERT base LS960 | L9 km2000 (Expresso) | Expresso + LJ + VCTK | generator.pt - config.json - speakers.txt - styles.txt |
HuBERT Mix1 (VP, MLS, CV) | L12 km2000 (Mix1) | Expresso + LJ + VCTK | generator.pt - config.json - speakers.txt - styles.txt |
HuBERT Mix1 (VP, MLS, CV) | L12 km2000 (Expresso) | Expresso + LJ + VCTK | generator.pt - config.json - speakers.txt - styles.txt |
The resynthesis can be obtained from textlesslib
as follows:
Please note that you'll need a reasonably recent version of fairseq (i.e. fairseq/tree/4db264940f) in order to load HuBERT Mix1 checkpoint.
import torchaudio
from textless.data.speech_encoder import SpeechEncoder
from textless.vocoders.hifigan.vocoder import CodeHiFiGANVocoder
# Available models
EXPRESSO_MODELS = [
("hubert-base-ls960-layer-9", "kmeans", 500),
("hubert-base-ls960-layer-9", "kmeans-expresso", 2000),
("mhubert-base-vp_mls_cv_8lang", "kmeans", 2000),
("mhubert-base-vp_mls_cv_8lang", "kmeans-expresso", 2000),
]
# Try one model
dense_model, quantizer_model, vocab = EXPRESSO_MODELS[3]
# Load speech encoder and vocoder
encoder = SpeechEncoder.by_name(
dense_model_name = dense_model,
quantizer_model_name = quantizer_model,
vocab_size = vocab,
deduplicate = False, # False if the vocoder doesn't support duration prediction
).cuda()
vocoder = CodeHiFiGANVocoder.by_name(
dense_model_name = dense_model,
quantizer_model_name = quantizer_model,
vocab_size = vocab,
speaker_meta = True,
style_meta = True
).cuda()
speakers = vocoder.speakers # ['ex01', 'ex02', 'ex03', 'ex04', 'lj', 'vctk_p225', ...]
styles = vocoder.styles # ['read-default', 'read-happy', 'read-sad', 'read-whisper', ...]
# Load the audio
input_file = "path/to/audio.wav"
waveform, sample_rate = torchaudio.load(input_file)
# Convert it to (duplicated) units
encoded = encoder(waveform.cuda())
units = encoded["units"] # torch.Tensor([17, 17, 17, 17, 296, 296,...])
# Convert units back to audio
audio = vocoder(
units,
speaker_id=speakers.index('ex01'),
style_id=styles.index('read-default'),
) # torch.Tensor([-9.9573e-04, -1.7003e-04, -6.8756e-05,...])
You can use the speech-resynthesis library to train a speaker- and style-conditioned hifigan model, more information can be found in this repo.
We transcibre the synthesized speech using a pre-trained Automatic Speech Recognition (ASR) model (wav2vec2 pre-trained on Librilight60kh fine-tuned on Librispeech960h, get wav2vec_vox_960h_pl.pt
checkpoint here) and compute the Word Error Rate (WER) between the transcribed text and the true transcription.
You'll need to prepare a manifest file along with a transcription to perform the ASR.
The manifest file can be obtain with wav2vec's manifest script here, and the transcription is expected to be in letter format.
Here is an example of the manifest file expresso.tsv
/root/to/synthesized/expresso/audio
ex04_default_00340_gen.wav 30381
ex04_default_00341_gen.wav 29970
...
and the transciption file expresso.ltr
C A N | W E | G O | T H E R E | P L E A S E
C O M P U T E | F O U R | S I X T E E N
...
You'll need to download wav2vec dictionary dict.ltr.txt
and put it into data directory
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt -P $DATA_DIR
The data directory now should contain the following files
ls $DATA_DIR
> dict.ltr.txt expresso.ltr expresso.tsv
The ASR is performed using fairseq's speech recognition example. Here is the example command:
SUBSET_NAME="expresso"
cd $FAIRSEQ_ROOT
python examples/speech_recognition/infer.py \
${DATA_DIR} --task audio_finetuning \
--nbest 1 \
--path path/to/wav2vec_vox_960h_pl.pt \
--w2l-decoder viterbi \
--criterion ctc \
--labels ltr \
--max-tokens 4000000 \
--post-process letter \
--gen-subset ${SUBSET_NAME} \
--results-path ${OUTPUT_DIR} \
Please note that you'll need a reasonably recent version of fairseq (i.e. fairseq/tree/4db264940f) in order to run the ASR script. You'll also need to install flashlight python bindings, follow the instructions on this page to install. Flashlight v0.3.2 must be used to install the bindings:
git clone --branch v0.3.2 https://github.com/flashlight/flashlight
We fine-tune the wav2vec2-base model on audio classification task using transformers library. We use the expressive style in the Expresso dataset as the labels.
You can download model checkpoint and config here
Model Checkpoint | Config | Preprocessor Config |
---|---|---|
pytorch_model.bin | config.json | preprocessor_config.json |
Use the classify_audio.py
script to perform prediction and possibly compute the accuracy if the label file is given. Here is an example command
MANIFEST_FILE=$DATA_DIR/expresso.tsv
LABEL_FILE=$DATA_DIR/expresso.labels # each line contains the true label of the corresponding audio in the manifest file
PREDICTION_FILE=$DATA_DIR/expresso.predictions
python classify_audio.py \
--model_ckpt $CKPT_DIR \
--from_tsv $MANIFEST_FILE \
--label_file $LABEL_FILE \
--output_file $PREDICTION_FILE
We'll use the F0 evaluation script from FAIRSEQ S^2.
For F0 evals, you need to prepare a generated_data.tsv
file, containing "ref"
and "syn"
fields with the ground truth audio and generated audio as follows
"ref" "syn"
/path/to/ex04_default_00340_gt.wav /path/to/ex04_default_00340_gen.wav
/path/to/ex04_default_00341_gt.wav /path/to/ex04_default_00341_gen.wav
...
Then run the following command
cd $FAIRSEQ_ROOT
python examples/speech_synthesis/evaluation/eval_f0.py \
generated_data.tsv --ffe
Please consider citing our work if you find it useful in your research:
@misc{nguyen2023expresso,
title={EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis},
author={Tu Anh Nguyen and Wei-Ning Hsu and Antony D'Avirro and Bowen Shi and Itai Gat and Maryam Fazel-Zarani and Tal Remez and Jade Copet and Gabriel Synnaeve and Michael Hassid and Felix Kreuk and Yossi Adi and Emmanuel Dupoux},
year={2023},
eprint={2308.05725},
archivePrefix={arXiv},
primaryClass={cs.CL}
}