ZeroSep is a training-free audio source separation framework that repurposes pre-trained text-guided diffusion models for zero-shot separation.
No fine-tuning, no task-specific data—just latent inversion + text-conditioned denoising to isolate any sound you describe.
- Zero-shot separation: separate without any additional training
- Open-set: isolate arbitrary sounds via natural‐language prompts
- Model‐agnostic: works with AudioLDM, AudioLDM2, Tango, or any text-guided diffusion backbone
- Flexible inversion: choose DDIM or DDPM
- Built-in Gradio demo for quick interactive use
-
Clone this repo
git clone https://github.com/WikiChao/ZeroSep.git cd ZeroSep -
(Optional) Create & activate a Conda environment
conda create -n zerosep python=3.9 conda activate zerosep
-
Install dependencies
pip install -r requirements.txt
-
(If using private Hugging Face models)
huggingface-cli login
5.(If using tango) download tango model from https://github.com/declare-lab/tango and put them into /code folder
Launch the interactive demo:
cd code
python demo.pyThen open http://localhost:7860 or the public link in your browser. Upload an audio/video file, select your model & inversion strategy, enter a prompt (e.g. “dog bark”), and click Run.
Separate a single audio file with one command:
cd code
python separate.py --input examples/BMayJId0X1s_120.wav --target "man speech"python separate.py --input examples/BMayJId0X1s_120.wav \
--target "man speech" \
--source "man talking with background music" \
--model "cvssp/audioldm-s-full-v2" \
--mode "ddpm" \
--steps 50 \
--tstart 50 \
--seed 42 \
--target_guidance 1.0 \
--source_guidance 1.0 \
--output_dir results \
--output_name "extracted_speech"| Parameter | Short | Description | Default |
|---|---|---|---|
--input |
-i |
Input audio file | Required |
--target |
-t |
Sound to extract | Required |
--source |
-s |
Source description | "" |
--model |
-m |
Diffusion model | "cvssp/audioldm-s-full-v2" |
--mode |
Separation algorithm | "ddpm" | |
--steps |
Diffusion steps | 50 | |
--tstart |
Start timestep | Same as steps | |
--target_guidance |
Target CFG scale | 1.0 | |
--source_guidance |
Source CFG scale | 1.0 | |
--output_dir |
-o |
Output directory | "results" |
We've included several sample video/audio files in the examples folder to help you get started.
If you use ZeroSep, please cite our paper:
@article{huang2025zerosep,
title={ZeroSep: Separate Anything in Audio with Zero Training},
author={Huang, Chao and Ma, Yuesheng and Huang, Junxuan and Liang, Susan and Tang, Yunlong and Bi, Jing and Liu, Wenqiang and Mesgarani, Nima and Xu, Chenliang},
journal={arXiv preprint arXiv:2505.23625},
year={2025}
}This work would not have been possible without the contributions of several outstanding projects:
- AudioLDM & AudioLDM2 (Liu et al.) for providing the foundational diffusion model architectures
- Tango (Ghosal et al.) for their audio generation framework and model support
- Gradio team for their excellent interactive UI framework enabling our demo
- AudioEditingCode by Manor et al. - Our implementation builds substantially upon their codebase. We sincerely appreciate their work and encourage supporting their repository.
- Code adapted from AudioEditingCode, including inversion and forward processes, is used under their MIT license.
- AudioLDM and AudioLDM2 models are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Any use of these model weights is subject to the same license terms.
- All other original code in this repository is released under the MIT license.
This project is licensed under the MIT License – see LICENSE for details.