Skip to content

zeyuxie29/STAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🌠 STAR: Speech-to-Audio Generation via Representation Learning

arXiv githubio Hugging Face Space Youtube Demo

This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. It:

  • Recognize the potential of the speech-to-audio generation task and have designed the first E2E system STAR;
  • Validate E2E STA feasibility via representation learning experiments, showing that spoken sound event semantics can be directly extracted;
  • Achieve effective speech-to-audio modal alignment through a bridge network mapping mechanism and a two-stage training strategy;
  • Significantly reduces speech processing latency from 156ms to 36ms(≈ 76.9% reduction), whilesurpassing the generation performance of cascaded systems.

Table of Contents


✂️ Environment and Data Preparation

Environment Setup

Please refer to UniFlow-Audio for environment setup, as well as WavLM, fairseq, DAC, which are used for speech feature extraction.

Data Preparation

Generating corresponding speech from captions in Audiocaps, followed by feature extraction using different speech encoders (DAC, Hubert, WavLM):

git clone https://github.com/zeyuxie29/STAR
python src/data_preparation/vits/vits_inference.py
python src/data_preparation/data_preparation/speech_encoder/hubert_extract_feature.py

💡 Stage1: Bridge Network

Pre-train the Bridge Network using sound event labels from AudioSet

python src/bridge_network/qformer_predictions.py

🌱 Stage2: STA Generation

Train end-to-end speech-to-audio generation using speech-audio data

sh src/sta_generation/bash_scripts_star/train_star_fm.sh
sh src/sta_generation/bash_scripts_star/infer_multi_gpu.sh
python src/sta_generation/evaluation/star.py --gen_audio_dir {generated_audio_folder}

Acknowledgement

Our code referred to the WavLM, fairseq, DAC, SECap, HEAR. We appreciate their open-sourcing of their code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages