ZeroSep: Separate Anything in Audio with Zero Training

ZeroSep is a training-free audio source separation framework that repurposes pre-trained text-guided diffusion models for zero-shot separation.
No fine-tuning, no task-specific data—just latent inversion + text-conditioned denoising to isolate any sound you describe.

Demo video: ZeroSep separates speech from a mix with a simple text prompt.

🚀 Features

Zero-shot separation: separate without any additional training
Open-set: isolate arbitrary sounds via natural‐language prompts
Model‐agnostic: works with AudioLDM, AudioLDM2, Tango, or any text-guided diffusion backbone
Flexible inversion: choose DDIM or DDPM
Built-in Gradio demo for quick interactive use

📦 Installation

Clone this repo

git clone https://github.com/WikiChao/ZeroSep.git
cd ZeroSep

(Optional) Create & activate a Conda environment

conda create -n zerosep python=3.9
conda activate zerosep

Install dependencies
```
pip install -r requirements.txt
```
(If using private Hugging Face models)
```
huggingface-cli login
```

5.(If using tango) download tango model from https://github.com/declare-lab/tango and put them into /code folder

🛠️ Usage

1. Gradio Web App

Launch the interactive demo:

cd code
python demo.py

Then open http://localhost:7860 or the public link in your browser. Upload an audio/video file, select your model & inversion strategy, enter a prompt (e.g. “dog bark”), and click Run.

2. Command-Line Interface

Separate a single audio file with one command:

cd code
python separate.py --input examples/BMayJId0X1s_120.wav --target "man speech"

Complete Example with All Parameters

python separate.py --input examples/BMayJId0X1s_120.wav \
                   --target "man speech" \
                   --source "man talking with background music" \
                   --model "cvssp/audioldm-s-full-v2" \
                   --mode "ddpm" \
                   --steps 50 \
                   --tstart 50 \
                   --seed 42 \
                   --target_guidance 1.0 \
                   --source_guidance 1.0 \
                   --output_dir results \
                   --output_name "extracted_speech"

Parameter Reference

Parameter	Short	Description	Default
`--input`	`-i`	Input audio file	Required
`--target`	`-t`	Sound to extract	Required
`--source`	`-s`	Source description	""
`--model`	`-m`	Diffusion model	"cvssp/audioldm-s-full-v2"
`--mode`		Separation algorithm	"ddpm"
`--steps`		Diffusion steps	50
`--tstart`		Start timestep	Same as steps
`--target_guidance`		Target CFG scale	1.0
`--source_guidance`		Source CFG scale	1.0
`--output_dir`	`-o`	Output directory	"results"

🎵 Examples

We've included several sample video/audio files in the examples folder to help you get started.

📖 Citation

If you use ZeroSep, please cite our paper:

@article{huang2025zerosep,
  title={ZeroSep: Separate Anything in Audio with Zero Training},
  author={Huang, Chao and Ma, Yuesheng and Huang, Junxuan and Liang, Susan and Tang, Yunlong and Bi, Jing and Liu, Wenqiang and Mesgarani, Nima and Xu, Chenliang},
  journal={arXiv preprint arXiv:2505.23625},
  year={2025}
}

🙏 Acknowledgments

This work would not have been possible without the contributions of several outstanding projects:

AudioLDM & AudioLDM2 (Liu et al.) for providing the foundational diffusion model architectures
Tango (Ghosal et al.) for their audio generation framework and model support
Gradio team for their excellent interactive UI framework enabling our demo
AudioEditingCode by Manor et al. - Our implementation builds substantially upon their codebase. We sincerely appreciate their work and encourage supporting their repository.

Licensing Information

Code adapted from AudioEditingCode, including inversion and forward processes, is used under their MIT license.
AudioLDM and AudioLDM2 models are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Any use of these model weights is subject to the same license terms.
All other original code in this repository is released under the MIT license.

This project is licensed under the MIT License – see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
code		code
evals		evals
examples		examples
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ZeroSep: Separate Anything in Audio with Zero Training

🚀 Features

📦 Installation

🛠️ Usage

1. Gradio Web App

2. Command-Line Interface

Complete Example with All Parameters

Parameter Reference

🎵 Examples

📖 Citation

🙏 Acknowledgments

Licensing Information

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

WikiChao/ZeroSep

Folders and files

Latest commit

History

Repository files navigation

ZeroSep: Separate Anything in Audio with Zero Training

🚀 Features

📦 Installation

🛠️ Usage

1. Gradio Web App

2. Command-Line Interface

Complete Example with All Parameters

Parameter Reference

🎵 Examples

📖 Citation

🙏 Acknowledgments

Licensing Information

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages