This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. It:
- Recognize the potential of the speech-to-audio generation task and have designed the first E2E system STAR;
- Validate E2E STA feasibility via representation learning experiments, showing that spoken sound event semantics can be directly extracted;
- Achieve effective speech-to-audio modal alignment through a bridge network mapping mechanism and a two-stage training strategy;
- Significantly reduces speech processing latency from 156ms to 36ms(≈ 76.9% reduction), whilesurpassing the generation performance of cascaded systems.
Please refer to UniFlow-Audio for environment setup, as well as WavLM, fairseq, DAC, which are used for speech feature extraction.
Generating corresponding speech from captions in Audiocaps, followed by feature extraction using different speech encoders (DAC, Hubert, WavLM):
git clone https://github.com/zeyuxie29/STAR
python src/data_preparation/vits/vits_inference.py
python src/data_preparation/data_preparation/speech_encoder/hubert_extract_feature.pyPre-train the Bridge Network using sound event labels from AudioSet
python src/bridge_network/qformer_predictions.pyTrain end-to-end speech-to-audio generation using speech-audio data
sh src/sta_generation/bash_scripts_star/train_star_fm.sh
sh src/sta_generation/bash_scripts_star/infer_multi_gpu.sh
python src/sta_generation/evaluation/star.py --gen_audio_dir {generated_audio_folder}Our code referred to the WavLM, fairseq, DAC, SECap, HEAR. We appreciate their open-sourcing of their code.