A novel neural voice conversion system that leverages vowel-specific transformations and incremental memory learning for few-shot speaker adaptation.
VowelVC implements the core concepts of Incremental Conditional GMM based Voice Conversion using modern deep learning architectures (VAE, GAN, Diffusion). The system focuses on vowel-based transformations and supports minimal data adaptation.
- Vowel-based transformation: Uses vowel detection and transformation as core mechanism
- Few-Shot Adaptation: Single utterance speaker adaptation capability through incremental learning
- Multiple generators: Supports both GAN and Diffusion-based generation
- Memory-Augmented Learning: Dynamic vowel memory bank with attention mechanisms
- State-of-the-Art Performance: Outperforms contemporary neural voice conversion methods
- Real-Time Inference: Efficient processing suitable for interactive applications
- Comprehensive Evaluation: Rigorous experimental validation with multiple metrics
| Conversion Type | Baseline LSD (dB) | VowelVC LSD (dB) | Improvement |
|---|---|---|---|
| Male to Male | 14.77 | 7.96 | 46.1% |
| Male to Female | 15.25 | 7.52 | 50.7% |
| Female to Male | 15.18 | 8.02 | 47.2% |
| Female to Female | 14.50 | 8.06 | 44.5% |
| Conversion Type | Baseline MOS | VowelVC MOS | Improvement |
|---|---|---|---|
| Male to Male | 3.36 | 3.84 | 14.3% |
| Male to Female | 3.30 | 3.83 | 15.9% |
| Female to Male | 3.30 | 3.91 | 18.7% |
| Female to Female | 3.26 | 3.94 | 21.0% |
- Clone the repository:
git clone https://github.com/yourusername/VowelVC.git
cd VowelVC- Create a virtual environment:
python -m venv vowelvc_env
source vowelvc_env/bin/activate # On Windows: vowelvc_env\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install additional audio processing libraries:
pip install librosa soundfile- Optional - Install SpeechBrain for enhanced speaker embeddings:
pip install speechbrain# Download CMU ARCTIC dataset
python scripts/download_data.py --dataset arctic --output data/raw# Extract mel spectrograms and speaker embeddings
python extract_features.py \
--input_dir data/raw \
--mel_out data/processed/mel \
--spk_out data/processed/spk_embed# Basic training with default configuration
python scripts/train_vowelvc.py
# Training with three-stage approach (recommended)
python scripts/train_vowelvc.py --three-stages
# Training with custom configuration
python scripts/train_vowelvc.py --config configs/vowelvc_custom.yamlpython scripts/train_baseline.py --config configs/baseline.yamlpython scripts/inference.py \
--source path/to/source/audio.wav \
--target-speaker path/to/target/speaker/audio.wav \
--output path/to/converted/output.wav \
--model experiments/vowelvc/final_model.pth \
--model-type vowelvcpython scripts/batch_inference.py \
--source_dir data/test/source \
--target_dir data/test/targets \
--output_dir results/converted \
--model experiments/vowelvc/final_model.pthpython scripts/inference.py \
--source source_audio.wav \
--target-speaker target_speaker.wav \
--output converted_output.wav \
--model experiments/vowelvc/final_model.pth \
--model-type vowelvc \
--adapt-speaker # Enable single utterance adaptationpython scripts/evaluate.py \
--baseline-model experiments/baseline/final_model.pth \
--vowelvc-model experiments/vowelvc/final_model.pth \
--output-dir evaluation/results \
--n-samples 100python scripts/evaluate.py \
--vowelvc-model experiments/vowelvc/final_model.pth \
--output-dir evaluation/paper_results \
--paper-styleVowelVC/
├── core/ # Core components
│ ├── models/ # Model architectures
│ │ ├── base.py # Base model classes
│ │ ├── encoders.py # Content and speaker encoders
│ │ ├── generators.py # GAN and diffusion generators
│ │ └── vocoders.py # Vocoder implementations
│ └── incremental.py # Incremental learning module
├── models/ # Main model implementations
│ ├── baseline_vc.py # Baseline VAE-GAN model
│ ├── vowel_vc.py # VowelVC implementation
│ └── configs/ # Model configurations
├── training/ # Training components
│ ├── trainer.py # Base trainer class
│ ├── vowel_trainer.py # VowelVC-specific trainer
│ ├── datasets.py # Dataset implementations
│ └── losses.py # Loss functions
├── evaluation/ # Evaluation framework
│ ├── metrics.py # Objective metrics
│ ├── subjective.py # Subjective evaluation
│ └── paper_comparison.py # Paper-style evaluation
├── scripts/ # Utility scripts
│ ├── train_vowelvc.py # Training script
│ ├── inference.py # Inference script
│ └── evaluate.py # Evaluation script
├── configs/ # Configuration files
├── extract_features.py # Feature extraction
└── requirements.txt # Dependencies
Edit configs/vowel_vc.yaml to customize model parameters:
model:
name: "VowelVC"
mel_dim: 80
speaker_dim: 192
latent_dim: 64
vowel_dim: 32
generator_type: "gan" # or "diffusion"
training:
epochs: 100
batch_size: 16
learning_rate: 0.0002
data:
mel_dir: "data/processed/mel"
speaker_embed_dir: "data/processed/spk_embed"VowelVC employs a three-stage training approach:
- Stage 1: General model training on diverse speaker data
- Stage 2: Vowel-specific adaptation and memory optimization
- Stage 3: Incremental learning capability development
- Prepare your audio data in the following structure:
data/raw/
├── speaker1/
│ ├── audio1.wav
│ ├── audio2.wav
│ └── ...
├── speaker2/
│ └── ...
- Extract features for your dataset:
python extract_features.py --input_dir data/raw/your_dataset- Update configuration files to point to your data paths.
For speaker adaptation with minimal data:
from models.vowel_vc import VowelVC
import torch
# Load pre-trained model
model = VowelVC.from_pretrained("experiments/vowelvc/final_model.pth")
# Adapt to new speaker with single utterance
model.adapt_single_utterance(source_mel, target_speaker_embedding)
# Perform conversion
converted_audio = model.convert_voice(source_mel, target_speaker_embedding)from core.models.vocoders import HiFiGANVocoder
# Use HiFi-GAN for higher quality synthesis
vocoder = HiFiGANVocoder()
audio = vocoder(converted_mel_spectrogram)The evaluation framework provides comprehensive metrics:
-
Objective Metrics:
- Log Spectral Distance (LSD)
- Mel Cepstral Distortion (MCD)
- Speaker Similarity (cosine similarity)
-
Subjective Metrics:
- Mean Opinion Score (MOS)
- Preference Tests (Quality and Similarity)
| Method | LSD (dB) | MOS | Year |
|---|---|---|---|
| AutoVC | 10-15 | 3.2-3.5 | 2019 |
| AdaIN-VC | 8-12 | 3.5-3.8 | 2021 |
| VowelVC | 7.5-8.1 | 3.8-3.9 | 2025 |
Component contributions to overall performance:
- Vowel Memory Bank: 15-20% LSD improvement
- Incremental Adapter: 40% few-shot performance gain
- Enhanced Content Encoder: 8-12% MOS improvement
- Original ICGMM approach by Jannati and Sayadiyan (2015)
- CMU ARCTIC dataset for evaluation
- SpeechBrain team for speaker embedding models
- PyTorch team for the deep learning framework
For questions and support:
- Issues: GitHub Issues
- Email: parham.soltany@gmail.com
- Research Group: Mr Jannati
- Initial release with VowelVC implementation
- Comprehensive evaluation framework
- Paper-quality experimental results
- Single utterance adaptation capability