Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

Nikita Kuzmin^1,2*, Songting Liu^1*, Kong Aik Lee³, Eng Siong Chng¹
¹Nanyang Technological University, Singapore
²Institute for Infocomm Research, A*STAR, Singapore
³The Hong Kong Polytechnic University, Hong Kong
* Equal contribution

ICASSP 2026

Architecture

This repository contains the implementation of StreamVoiceAnon, a real-time voice anonymization / voice conversion model.

(a) Training

(b) Inference

Installation

git clone https://github.com/Plachtaa/StreamVoiceAnon.git
cd StreamVoiceAnon
pip install -r requirements.txt

If running on Windows OS, please install the following:

pip install triton-windows==3.2.0.post13

Note that this is compulsory to run inference with RTF < 1.0

Full MacOS support is still under construction.

Download Pretrained Models

hf download Plachta/StreamVoiceAnon --local-dir pretrained_checkpoints/

Training

Below is an example command to launch single node multi-GPU training with streaming Emilia dataset from HuggingFace:

accelerate launch trainers/arvc_trainer.py --config_path configs/config_firefly_arvcasr_8192_delay0_8.yaml --mixed-precision bf16

To customize model config or training datasets, we encourage users to read config files or training code.

Inference

Offline inference

python evaluations/infer_arvc.py \
    --src_path <path_to_audio> \
    --ref_path <path_to_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \  # Specify delay in number of frames (must have)
    --compile

Simulated online inference

python evaluations/infer_arvc.py \
    --src_path <path_to_audio> \
    --ref_path <path_to_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \  # Specify delay in number of frames (must have)
    --compile \
    --simulate_streaming \
    --decode_chunk_frames 1 # how many frames for encoder & vocoder to process each time

This simulates a chunk-by-chunk online inference with specified chunk size. src_path (source audio) has no length limit here. ref_path (reference audio) will be truncated to some maximum length (if longer than that limit)

Anonymization with noise mixing

Use the --alpha flag to control the noise mixing ratio on speaker embeddings. A value of 1.0 means no noise (pure voice conversion), while lower values blend more noise into the speaker representation for stronger anonymization.

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_reference_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --alpha 0.8 \
    --compile

Multiple reference audios

Provide multiple --ref_path entries to derive a combined speaker representation from several reference utterances. Using multiple references further improves privacy protection, making it harder to trace back to real speaker and better distorting the source's original speaker characteristics. You can optionally crop each reference to a specific duration (in seconds) with --ref_crop_lengths.

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_ref1> <path_to_ref2> <path_to_ref3> \
    --ref_crop_lengths 5.0 3.0 4.0 \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --compile

Combined: multiple references with noise anonymization

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_ref1> <path_to_ref2> \
    --ref_crop_lengths 5.0 5.0 \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --alpha 0.7 \
    --compile

Real-time inference

python evaluations/real-time-gui.py

This UI uses the same behavior as simulated online inference. It uses --compile by default, so please ensure you have installed triton (as previously stated) before using it.

TODO

Release privacy protection code
Release metrics for voice conversion & speaker anonymization
Release training code (for VC model)
Full MacOS support
More to be added

Citation

If you find our repository valuable for your work, please consider giving a star to this repo and citing our paper:

@misc{kuzmin2026streamvoiceanonenhancingutilityrealtime,
      title={Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models}, 
      author={Nikita Kuzmin and Songting Liu and Kong Aik Lee and Eng Siong Chng},
      year={2026},
      eprint={2601.13948},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2601.13948}, 
}

Acknowledgements

Co-author: https://github.com/paniquex
Computation resources: https://www.nscc.sg/
Real-time GUI: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
Speaker representations (1 of 2) https://huggingface.co/funasr/campplus
Speaker representations (2 of 2) https://github.com/SparkAudio/Spark-TTS
Speech acoustic codec https://huggingface.co/fishaudio/fish-speech-1.5
Idea: https://arxiv.org/html/2401.11053v1
VoicePrivacyChallenge: https://www.voiceprivacychallenge.org/

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
anon		anon
configs		configs
dataloaders		dataloaders
evaluations		evaluations
figures		figures
modules		modules
optimizers		optimizers
scripts		scripts
test_waves		test_waves
text_utils		text_utils
trainers		trainers
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

ICASSP 2026

Architecture

Installation

Download Pretrained Models

Training

Inference

Anonymization with noise mixing

Multiple reference audios

Combined: multiple references with noise anonymization

TODO

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Plachtaa/StreamVoiceAnon

Folders and files

Latest commit

History

Repository files navigation

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

ICASSP 2026

Architecture

Installation

Download Pretrained Models

Training

Inference

Anonymization with noise mixing

Multiple reference audios

Combined: multiple references with noise anonymization

TODO

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages