Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
1Nanyang Technological University, Singapore
2Institute for Infocomm Research, A*STAR, Singapore
3The Hong Kong Polytechnic University, Hong Kong
* Equal contribution
This repository contains the implementation of StreamVoiceAnon, a real-time voice anonymization / voice conversion model.
(a) Training |
(b) Inference |
git clone https://github.com/Plachtaa/StreamVoiceAnon.git
cd StreamVoiceAnon
pip install -r requirements.txtIf running on Windows OS, please install the following:
pip install triton-windows==3.2.0.post13Note that this is compulsory to run inference with RTF < 1.0
Full MacOS support is still under construction.
hf download Plachta/StreamVoiceAnon --local-dir pretrained_checkpoints/Below is an example command to launch single node multi-GPU training with streaming Emilia dataset from HuggingFace:
accelerate launch trainers/arvc_trainer.py --config_path configs/config_firefly_arvcasr_8192_delay0_8.yaml --mixed-precision bf16To customize model config or training datasets, we encourage users to read config files or training code.
Offline inference
python evaluations/infer_arvc.py \
--src_path <path_to_audio> \
--ref_path <path_to_audio> \
--out_dir <path_to_output_directory> \
--delay 2 \ # Specify delay in number of frames (must have)
--compileSimulated online inference
python evaluations/infer_arvc.py \
--src_path <path_to_audio> \
--ref_path <path_to_audio> \
--out_dir <path_to_output_directory> \
--delay 2 \ # Specify delay in number of frames (must have)
--compile \
--simulate_streaming \
--decode_chunk_frames 1 # how many frames for encoder & vocoder to process each timeThis simulates a chunk-by-chunk online inference with specified chunk size. src_path (source audio) has no length limit here. ref_path (reference audio) will be truncated to some maximum length (if longer than that limit)
Use the --alpha flag to control the noise mixing ratio on speaker embeddings. A value of 1.0 means no noise (pure voice conversion), while lower values blend more noise into the speaker representation for stronger anonymization.
python evaluations/infer_arvc.py \
--src_path <path_to_source_audio> \
--ref_path <path_to_reference_audio> \
--out_dir <path_to_output_directory> \
--delay 2 \
--alpha 0.8 \
--compileProvide multiple --ref_path entries to derive a combined speaker representation from several reference utterances. Using multiple references further improves privacy protection, making it harder to trace back to real speaker and better distorting the source's original speaker characteristics. You can optionally crop each reference to a specific duration (in seconds) with --ref_crop_lengths.
python evaluations/infer_arvc.py \
--src_path <path_to_source_audio> \
--ref_path <path_to_ref1> <path_to_ref2> <path_to_ref3> \
--ref_crop_lengths 5.0 3.0 4.0 \
--out_dir <path_to_output_directory> \
--delay 2 \
--compilepython evaluations/infer_arvc.py \
--src_path <path_to_source_audio> \
--ref_path <path_to_ref1> <path_to_ref2> \
--ref_crop_lengths 5.0 5.0 \
--out_dir <path_to_output_directory> \
--delay 2 \
--alpha 0.7 \
--compileReal-time inference
python evaluations/real-time-gui.pyThis UI uses the same behavior as simulated online inference. It uses --compile by default, so please ensure you have installed triton (as previously stated) before using it.
- Release privacy protection code
- Release metrics for voice conversion & speaker anonymization
- Release training code (for VC model)
- Full MacOS support
- More to be added
If you find our repository valuable for your work, please consider giving a star to this repo and citing our paper:
@misc{kuzmin2026streamvoiceanonenhancingutilityrealtime,
title={Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models},
author={Nikita Kuzmin and Songting Liu and Kong Aik Lee and Eng Siong Chng},
year={2026},
eprint={2601.13948},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2601.13948},
}
- Co-author: https://github.com/paniquex
- Computation resources: https://www.nscc.sg/
- Real-time GUI: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
- Speaker representations (1 of 2) https://huggingface.co/funasr/campplus
- Speaker representations (2 of 2) https://github.com/SparkAudio/Spark-TTS
- Speech acoustic codec https://huggingface.co/fishaudio/fish-speech-1.5
- Idea: https://arxiv.org/html/2401.11053v1
- VoicePrivacyChallenge: https://www.voiceprivacychallenge.org/

