🌠 STAR: Speech-to-Audio Generation via Representation Learning

This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. It:

Recognize the potential of the speech-to-audio generation task and have designed the first E2E system STAR;
Validate E2E STA feasibility via representation learning experiments, showing that spoken sound event semantics can be directly extracted;
Achieve effective speech-to-audio modal alignment through a bridge network mapping mechanism and a two-stage training strategy;
Significantly reduces speech processing latency from 156ms to 36ms(≈ 76.9% reduction), whilesurpassing the generation performance of cascaded systems.

✂️ Environment and Data Preparation

Environment Setup

Please refer to UniFlow-Audio for environment setup, as well as WavLM, fairseq, DAC, which are used for speech feature extraction.

Data Preparation

Generating corresponding speech from captions in Audiocaps, followed by feature extraction using different speech encoders (DAC, Hubert, WavLM):

git clone https://github.com/zeyuxie29/STAR
python src/data_preparation/vits/vits_inference.py
python src/data_preparation/data_preparation/speech_encoder/hubert_extract_feature.py

💡 Stage1: Bridge Network

Pre-train the Bridge Network using sound event labels from AudioSet

python src/bridge_network/qformer_predictions.py

🌱 Stage2: STA Generation

Train end-to-end speech-to-audio generation using speech-audio data

sh src/sta_generation/bash_scripts_star/train_star_fm.sh
sh src/sta_generation/bash_scripts_star/infer_multi_gpu.sh
python src/sta_generation/evaluation/star.py --gen_audio_dir {generated_audio_folder}

Acknowledgement

Our code referred to the WavLM, fairseq, DAC, SECap, HEAR. We appreciate their open-sourcing of their code.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
DemoFile		DemoFile
src		src
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌠 STAR: Speech-to-Audio Generation via Representation Learning

Table of Contents

✂️ Environment and Data Preparation

Environment Setup

Data Preparation

💡 Stage1: Bridge Network

🌱 Stage2: STA Generation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

zeyuxie29/STAR

Folders and files

Latest commit

History

Repository files navigation

🌠 STAR: Speech-to-Audio Generation via Representation Learning

Table of Contents

✂️ Environment and Data Preparation

Environment Setup

Data Preparation

💡 Stage1: Bridge Network

🌱 Stage2: STA Generation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages