Skip to content

WikiChao/ZeroSep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZeroSep: Separate Anything in Audio with Zero Training

ArXiv 2025 访客统计 GitHub stars Static Badge

Project pagePaper

ZeroSep is a training-free audio source separation framework that repurposes pre-trained text-guided diffusion models for zero-shot separation.
No fine-tuning, no task-specific data—just latent inversion + text-conditioned denoising to isolate any sound you describe.

ZeroSep Demo

Demo video: ZeroSep separates speech from a mix with a simple text prompt.


🚀 Features

  • Zero-shot separation: separate without any additional training
  • Open-set: isolate arbitrary sounds via natural‐language prompts
  • Model‐agnostic: works with AudioLDM, AudioLDM2, Tango, or any text-guided diffusion backbone
  • Flexible inversion: choose DDIM or DDPM
  • Built-in Gradio demo for quick interactive use

📦 Installation

  1. Clone this repo

    git clone https://github.com/WikiChao/ZeroSep.git
    cd ZeroSep
    
  2. (Optional) Create & activate a Conda environment

    conda create -n zerosep python=3.9
    conda activate zerosep
  3. Install dependencies

    pip install -r requirements.txt
  4. (If using private Hugging Face models)

    huggingface-cli login

5.(If using tango) download tango model from https://github.com/declare-lab/tango and put them into /code folder


🛠️ Usage

1. Gradio Web App

Launch the interactive demo:

cd code
python demo.py

Then open http://localhost:7860 or the public link in your browser. Upload an audio/video file, select your model & inversion strategy, enter a prompt (e.g. “dog bark”), and click Run.

2. Command-Line Interface

Separate a single audio file with one command:

cd code
python separate.py --input examples/BMayJId0X1s_120.wav --target "man speech"

Complete Example with All Parameters

python separate.py --input examples/BMayJId0X1s_120.wav \
                   --target "man speech" \
                   --source "man talking with background music" \
                   --model "cvssp/audioldm-s-full-v2" \
                   --mode "ddpm" \
                   --steps 50 \
                   --tstart 50 \
                   --seed 42 \
                   --target_guidance 1.0 \
                   --source_guidance 1.0 \
                   --output_dir results \
                   --output_name "extracted_speech"

Parameter Reference

Parameter Short Description Default
--input -i Input audio file Required
--target -t Sound to extract Required
--source -s Source description ""
--model -m Diffusion model "cvssp/audioldm-s-full-v2"
--mode Separation algorithm "ddpm"
--steps Diffusion steps 50
--tstart Start timestep Same as steps
--target_guidance Target CFG scale 1.0
--source_guidance Source CFG scale 1.0
--output_dir -o Output directory "results"

🎵 Examples

We've included several sample video/audio files in the examples folder to help you get started.


📖 Citation

If you use ZeroSep, please cite our paper:

@article{huang2025zerosep,
  title={ZeroSep: Separate Anything in Audio with Zero Training},
  author={Huang, Chao and Ma, Yuesheng and Huang, Junxuan and Liang, Susan and Tang, Yunlong and Bi, Jing and Liu, Wenqiang and Mesgarani, Nima and Xu, Chenliang},
  journal={arXiv preprint arXiv:2505.23625},
  year={2025}
}

🙏 Acknowledgments

This work would not have been possible without the contributions of several outstanding projects:

  • AudioLDM & AudioLDM2 (Liu et al.) for providing the foundational diffusion model architectures
  • Tango (Ghosal et al.) for their audio generation framework and model support
  • Gradio team for their excellent interactive UI framework enabling our demo
  • AudioEditingCode by Manor et al. - Our implementation builds substantially upon their codebase. We sincerely appreciate their work and encourage supporting their repository.

Licensing Information

This project is licensed under the MIT License – see LICENSE for details.

About

[NeurIPS 2025] Separate Anything in Audio with Zero Training

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages