Skip to content

Plachtaa/StreamVoiceAnon

Repository files navigation

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

ICASSP 2026 Demo Paper Models

Nikita Kuzmin1,2*, Songting Liu1*, Kong Aik Lee3, Eng Siong Chng1
1Nanyang Technological University, Singapore
2Institute for Infocomm Research, A*STAR, Singapore
3The Hong Kong Polytechnic University, Hong Kong
* Equal contribution

ICASSP 2026

Architecture

This repository contains the implementation of StreamVoiceAnon, a real-time voice anonymization / voice conversion model.

Training Architecture
(a) Training
Inference Architecture
(b) Inference

Installation

git clone https://github.com/Plachtaa/StreamVoiceAnon.git
cd StreamVoiceAnon
pip install -r requirements.txt

If running on Windows OS, please install the following:

pip install triton-windows==3.2.0.post13

Note that this is compulsory to run inference with RTF < 1.0

Full MacOS support is still under construction.

Download Pretrained Models

hf download Plachta/StreamVoiceAnon --local-dir pretrained_checkpoints/

Training

Below is an example command to launch single node multi-GPU training with streaming Emilia dataset from HuggingFace:

accelerate launch trainers/arvc_trainer.py --config_path configs/config_firefly_arvcasr_8192_delay0_8.yaml --mixed-precision bf16

To customize model config or training datasets, we encourage users to read config files or training code.

Inference

Offline inference

python evaluations/infer_arvc.py \
    --src_path <path_to_audio> \
    --ref_path <path_to_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \  # Specify delay in number of frames (must have)
    --compile

Simulated online inference

python evaluations/infer_arvc.py \
    --src_path <path_to_audio> \
    --ref_path <path_to_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \  # Specify delay in number of frames (must have)
    --compile \
    --simulate_streaming \
    --decode_chunk_frames 1 # how many frames for encoder & vocoder to process each time

This simulates a chunk-by-chunk online inference with specified chunk size. src_path (source audio) has no length limit here. ref_path (reference audio) will be truncated to some maximum length (if longer than that limit)

Anonymization with noise mixing

Use the --alpha flag to control the noise mixing ratio on speaker embeddings. A value of 1.0 means no noise (pure voice conversion), while lower values blend more noise into the speaker representation for stronger anonymization.

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_reference_audio> \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --alpha 0.8 \
    --compile

Multiple reference audios

Provide multiple --ref_path entries to derive a combined speaker representation from several reference utterances. Using multiple references further improves privacy protection, making it harder to trace back to real speaker and better distorting the source's original speaker characteristics. You can optionally crop each reference to a specific duration (in seconds) with --ref_crop_lengths.

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_ref1> <path_to_ref2> <path_to_ref3> \
    --ref_crop_lengths 5.0 3.0 4.0 \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --compile

Combined: multiple references with noise anonymization

python evaluations/infer_arvc.py \
    --src_path <path_to_source_audio> \
    --ref_path <path_to_ref1> <path_to_ref2> \
    --ref_crop_lengths 5.0 5.0 \
    --out_dir <path_to_output_directory> \
    --delay 2 \
    --alpha 0.7 \
    --compile

Real-time inference

python evaluations/real-time-gui.py

This UI uses the same behavior as simulated online inference. It uses --compile by default, so please ensure you have installed triton (as previously stated) before using it.

TODO

  • Release privacy protection code
  • Release metrics for voice conversion & speaker anonymization
  • Release training code (for VC model)
  • Full MacOS support
  • More to be added

Citation

If you find our repository valuable for your work, please consider giving a star to this repo and citing our paper:

@misc{kuzmin2026streamvoiceanonenhancingutilityrealtime,
      title={Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models}, 
      author={Nikita Kuzmin and Songting Liu and Kong Aik Lee and Eng Siong Chng},
      year={2026},
      eprint={2601.13948},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2601.13948}, 
}

Acknowledgements

About

Real-time streaming voice anonymization & voice conversion

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •