Update: Accepted at EMNLP 2023. The paper is available here.
End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentence-level version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SegAugment, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SegAugment obtains state-of-the-art results in MuST-C. Finally, we show that the proposed method can also successfully augment sentence-level datasets, and that it enables Speech Translation models to close the gap between the manual and automatic segmentation at inference time.Here you can download the generated data from SegAugment for MuST-C, mTEDx and CoVoST.
The format is similar to the one found in MuST-C and mTEDx:
- .src: A text file with the transcription for each example
- .tgt: A text file with the translation for each example
- .yaml: A yaml file with the offset, duration and corresponding audio file for each example
MuST-C v1.0
En-De | short | medium | long | extra-long |
---|---|---|---|---|
En-Es | short | medium | long | extra-long |
En-Fr | short | medium | long | extra-long |
En-It | short | medium | long | extra-long |
En-Nl | short | medium | long | extra-long |
En-Pt | short | medium | long | extra-long |
En-Ro | short | medium | long | extra-long |
En-Ru | short | medium | long | extra-long |
MuST-C v2.0
En-De | short | medium | long | extra-long |
---|
mTEDx
Es-En | short | medium | long | extra-long |
---|---|---|---|---|
Es-Fr | short | medium | long | extra-long |
Pt-En | short | medium | long | extra-long |
Es-Es | short | medium | long | extra-long |
Pt-Pt | short | medium | long | extra-long |
CoVoST
En-De | short | medium |
---|
To use the data for Speech Translation you will have to also download the original audio files for each dataset.
Set the environment variables:
export SEGAUGMENT_ROOT=... # the path to this repo
export OUTPUT_ROOT=... # the path to save the outputs of SegAugment, including synthetic data, alignments, and models
export FAIRSEQ_ROOT=... # the path to our fairseq fork
export SHAS_ROOT=... # the path to the SHAS repo
export SHAS_CKPTS=... # the path to the pre-trained SHAS classifier checkpoints
export MUSTCv2_ROOT=... # the path to save MuST-C v2.0
export MUSTCv1_ROOT=... # the path to save MuST-C v1.0
export MTEDX_ROOT=... # the path to save mTEDx
Clone this repository to $SEGAUGMENT_ROOT
:
git clone https://github.com/mt-upc/SegAugment.git ${SEGAUGMENT_ROOT}
Create a conda environment using the environment.yml
file and activate it:
conda env create -f ${SEGAUGMENT_ROOT}/environment.yml && \
conda activate seg_augment
Install our fork of fairseq:
git clone -b SegAugment https://github.com/mt-upc/fairseq-internal.git ${FAIRSEQ_ROOT}
pip install --editable ${FAIRSEQ_ROOT}
Clone the SHAS repository to $SHAS_ROOT
:
git clone -b experimental https://github.com/mt-upc/SHAS.git ${SHAS_ROOT}
Create a second conda environment for SHAS (no need to activate it for now):
conda env create -f ${SHAS_ROOT}/environment.yml
Download the English and Multilingual pre-trained SHAS classifier and save at $SHAS_CKPTS
:
English | Multilingual |
---|
For our main experiments we used MuST-C and mTEDx. Follow the instructions here to download and prepare the original data.
Download MuST-C v2.0 En-De to $MUSTCv2_ROOT
and the v1.0 En-X to $MUSTCv1_ROOT
:
The dataset is available here. Press the bottom ”click here to download the corpus”, and select version V1 and V2 accordingly.
To prepare the data for training, run the following processing scripts. (We are also using the ASR data from v2.0 for pre-training.)
python ${FAIRSEQ_ROOT}/examples/speech_to_text/prep_mustc_data.py \
--data-root ${MUSTCv2_ROOT} --task asr --vocab-type unigram --vocab-size 5000
for root in $MUSTCv2_ROOT $MUSTCv1_ROOT; do
python ${FAIRSEQ_ROOT}/examples/speech_to_text/prep_mustc_data.py \
--data-root $root --task st --vocab-type unigram --vocab-size 8000
done
Download the mTEDx Es-En, Es-Pt, Es-Fr and Es, Pt ASR data to $MTEDX_ROOT
and run the processing scripts to prepare them:
mkdir -p ${MTEDX_ROOT}/log_dir
for lang_pair in {es-en,pt-en,es-fr,es-es,pt-pt}; do
wget https://www.openslr.org/resources/100/mtedx_${lang_pair}.tgz -o ${MTEDX_ROOT}/log_dir/${lang_pair} -c -b -O - | tar -xz -C ${MTEDX_ROOT}
done
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task asr --vocab-type unigram --vocab-size 5000 --lang-pairs es-es,pt-pt
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task st --vocab-type unigram --vocab-size 8000 --lang-pairs es-en,pt-en
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task st --vocab-type unigram --vocab-size 1000 --lang-pairs es-fr
Set up some useful parameters:
dataset_root=... # the path to the dataset you want to augment
src_lang=... # the source language id (eg. "en")
tgt_lang=... # the target language id (eg. "de")
min=... # the minimum segment length in seconds
max=... # the maximum segment length in seconds
shas_ckpt=... # the path to the pre-trained SHAS classifier ckpt (English/Multilingual)
shas_alg=... # the type of segmentation algorithm (use "pdac" in general, and "pstrm" for max > 20)
The following script will execute all steps of SegAugment in sequence and create the synthetic data for a given dataset.
bash ${SEGAUGMENT_ROOT}/src/seg_augment.sh \
$dataset_root $src_lang $tgt_lang $min $max $shas_ckpt $shas_alg
However since most steps can be done on parallel it is not very efficient. It is advisable to run the above command only after you have completed one round of augmentation with $min
-$max
since intermediate results will be cached.
The following steps can be run in parallel:
Get an alternative segmentation for each document in the training set with SHAS.
conda activate shas
synthetic_data_dir=${OUTPUT_ROOT}/synthetic_data/${dataset_name}/${lang_pair}/${ell}/${split}
python $SHAS_ROOT/src/supervised_hybrid/segment.py \
-wav ${dataset_root}/${lang_pair}/data/${split}/wav \
-ckpt $shas_ckpt \
-max $max \
-min $min \
-alg $alg \
-cache ${OUTPUT_ROOT}/shas_probabilities/${dataset_name}/${src_lang} \
-yaml $synthetic_data_dir/new.yaml
conda activate seg_augment
Get the word segments for each document in the training set with CTC-based forced-alignment.
forced_alignment_dir=${OUTPUT_ROOT}/forced_alignment/${dataset_name}/${src_lang}
python ${SEGAUGMENT_ROOT}/src/audio_alignment/get_word_segments.py \
-lang $src_lang \
-wav ${dataset_root}/${lang_pair}/data/${split}/wav \
-txt ${dataset_root}/${lang_pair}/data/${split}/txt/${split}.${src_lang} \
-yaml ${dataset_root}/${lang_pair}/data/${split}/txt/${split}.yaml \
-out $forced_alignment_dir
Learn the text alignment in the training set with an MT model.
bash ${SEGAUGMENT_ROOT}/src/text_alignment/get_alignment_model.sh \
$dataset_root $src_lang $tgt_lang $min $max $shas_ckpt
When all three steps are completed, get the synthetic transcriptions and translations:
python ${SEGAUGMENT_ROOT}/src/audio_alignment/get_source_text.py \
-new_yaml $synthetic_data_dir/new.yaml -align $forced_alignment_dir -lang $src_lang
bash ${SEGAUGMENT_ROOT}/src/text_alignment/get_target_text.sh \
$dataset_root $src_lang $tgt_lang $min $max
- The output is stored at
$OUTPUT_ROOT/synthetic_data/<dataset_name>/<lang_pair>/<ell>/train
and is the same as the files available to download at this section.; - The process can be repeated for different
$min
-$max
. Several intermediate steps are cached, so that another augmentation is faster.; - Segmentation and audio alignments do not have to be repeated for the same dataset but with a different target language.;
- The above scripts would work for any dataset that has the same file structure as MuST-C or mTEDx. This is
$DATASET_ROOT/<lang_pair>/data/<split>/txt/<split>.{src,tgt,yaml}
. Modifications would be required for other structures.
To use the synthetic data for training ST models, you need to run a processing script that creates a tsv file, similar to the one for the orginal data. The process is much faster when more than 8 CPU cores are available.
bash ${SEGAUGMENT_ROOT}/src/utils/prep_synthetic_tsv.sh \
-data $dataset_root -src $src_lang -tgt $tgt_lang -ell ${min}-${max}
Example for MuST-C v2.0 En-De.
ASR pre-training on the original data:
bash $SEGAUGMENT_ROOT/src/experiments/mustc/train_asr_original.sh
ST training with the original and synthetic data (short, medium, long, xlong):
bash $SEGAUGMENT_ROOT/src/experiments/mustc/train_st_synthetic-all4.sh en-de
Example for mTEDx Es-En. (Use the "xs" model for Es-Fr)
For the low-resource pairs of mTEDx we found that ASR pre-training with the synthetic data was very beneficial:
bash $SEGAUGMENT_ROOT/src/experiments/mtedx/train_asr_synthetic-all4.sh es-es s
ST training with the original and synthetic data (short, medium, long, xlong):
bash $SEGAUGMENT_ROOT/src/experiments/mtedx/train_st_synthetic-all4.sh es-en s
@inproceedings{tsiamas-etal-2023-segaugment,
title = {{SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations}},
author = "Tsiamas, Ioannis and
Fonollosa, Jos{\'e} and
Costa-juss{\`a}, Marta",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.574",
doi = "10.18653/v1/2023.findings-emnlp.574",
pages = "8569--8588",
}