🌠 STAR: Speech-to-Audio Generation via Representation Learning

This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. It:

Recognize the potential of the speech-to-audio generation task and have designed the first E2E system STAR;
Validate E2E STA feasibility via representation learning experiments, showing that spoken sound event semantics can be directly extracted;
Achieve effective speech-to-audio modal alignment through a bridge network mapping mechanism and a two-stage training strategy;
Significantly reduces speech processing latency from 156ms to 36ms(≈ 76.9% reduction), whilesurpassing the generation performance of cascaded systems.

✂️ Data Preparation

Generating corresponding speech from captions in Audiocaps, followed by feature extraction using different speech encoders (DAC, Hubert, WavLM):

git clone https://github.com/AnonymousGit0/STAR
python src/data_preparation/vits/vits_inference.py
python src/data_preparation/data_preparation/speech_encoder/hubert_extract_feature.py

💡 Stage1: Bridge Network

Pre-train the Bridge Network using sound event labels from AudioSet

python src/bridge_network/qformer_predictions.py

🌱 Stage2: STA Generation

Train end-to-end speech-to-audio generation using speech-audio data

sh src/sta_generation/bash_scripts_star/train_star_fm.sh
sh src/sta_generation/bash_scripts_star/infer_multi_gpu.sh
python src/sta_generation/evaluation/star.py --gen_audio_dir {generated_audio_folder}

Acknowledgement

Our code referred to the WavLM, fairseq, DAC, SECap, HEAR. We appreciate their open-sourcing of their code.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DemoFile		DemoFile
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌠 STAR: Speech-to-Audio Generation via Representation Learning

Table of Contents

✂️ Data Preparation

💡 Stage1: Bridge Network

🌱 Stage2: STA Generation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

AnonymousGit0/STAR

Folders and files

Latest commit

History

Repository files navigation

🌠 STAR: Speech-to-Audio Generation via Representation Learning

Table of Contents

✂️ Data Preparation

💡 Stage1: Bridge Network

🌱 Stage2: STA Generation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages