VowelVC: Vowel-based Voice Conversion with Few-shot Incremental Learning

A novel neural voice conversion system that leverages vowel-specific transformations and incremental memory learning for few-shot speaker adaptation.

Overview

VowelVC implements the core concepts of Incremental Conditional GMM based Voice Conversion using modern deep learning architectures (VAE, GAN, Diffusion). The system focuses on vowel-based transformations and supports minimal data adaptation.

Key Features

Vowel-based transformation: Uses vowel detection and transformation as core mechanism
Few-Shot Adaptation: Single utterance speaker adaptation capability through incremental learning
Multiple generators: Supports both GAN and Diffusion-based generation
Memory-Augmented Learning: Dynamic vowel memory bank with attention mechanisms
State-of-the-Art Performance: Outperforms contemporary neural voice conversion methods
Real-Time Inference: Efficient processing suitable for interactive applications
Comprehensive Evaluation: Rigorous experimental validation with multiple metrics

Performance Results

Objective Metrics (1-3 Training Sentences)

Conversion Type	Baseline LSD (dB)	VowelVC LSD (dB)	Improvement
Male to Male	14.77	7.96	46.1%
Male to Female	15.25	7.52	50.7%
Female to Male	15.18	8.02	47.2%
Female to Female	14.50	8.06	44.5%

Subjective Metrics (MOS Scores)

Conversion Type	Baseline MOS	VowelVC MOS	Improvement
Male to Male	3.36	3.84	14.3%
Male to Female	3.30	3.83	15.9%
Female to Male	3.30	3.91	18.7%
Female to Female	3.26	3.94	21.0%

Installation

Setup Instructions

Clone the repository:

git clone https://github.com/yourusername/VowelVC.git
cd VowelVC

Create a virtual environment:

python -m venv vowelvc_env
source vowelvc_env/bin/activate  # On Windows: vowelvc_env\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install additional audio processing libraries:

pip install librosa soundfile

Optional - Install SpeechBrain for enhanced speaker embeddings:

pip install speechbrain

Quick Start

1. Data Preparation

Download Sample Data

# Download CMU ARCTIC dataset
python scripts/download_data.py --dataset arctic --output data/raw

Extract Features

# Extract mel spectrograms and speaker embeddings
python extract_features.py \
    --input_dir data/raw \
    --mel_out data/processed/mel \
    --spk_out data/processed/spk_embed

2. Training

Train VowelVC Model

# Basic training with default configuration
python scripts/train_vowelvc.py

# Training with three-stage approach (recommended)
python scripts/train_vowelvc.py --three-stages

# Training with custom configuration
python scripts/train_vowelvc.py --config configs/vowelvc_custom.yaml

Train Baseline Model (for comparison)

python scripts/train_baseline.py --config configs/baseline.yaml

3. Inference

Convert Single Audio File

python scripts/inference.py \
    --source path/to/source/audio.wav \
    --target-speaker path/to/target/speaker/audio.wav \
    --output path/to/converted/output.wav \
    --model experiments/vowelvc/final_model.pth \
    --model-type vowelvc

Batch Conversion

python scripts/batch_inference.py \
    --source_dir data/test/source \
    --target_dir data/test/targets \
    --output_dir results/converted \
    --model experiments/vowelvc/final_model.pth

Speaker Adaptation (Few-Shot Learning)

python scripts/inference.py \
    --source source_audio.wav \
    --target-speaker target_speaker.wav \
    --output converted_output.wav \
    --model experiments/vowelvc/final_model.pth \
    --model-type vowelvc \
    --adapt-speaker  # Enable single utterance adaptation

4. Evaluation

Comprehensive Evaluation

python scripts/evaluate.py \
    --baseline-model experiments/baseline/final_model.pth \
    --vowelvc-model experiments/vowelvc/final_model.pth \
    --output-dir evaluation/results \
    --n-samples 100

Generate Paper-Style Results

python scripts/evaluate.py \
    --vowelvc-model experiments/vowelvc/final_model.pth \
    --output-dir evaluation/paper_results \
    --paper-style

Project Structure

VowelVC/
├── core/                          # Core components
│   ├── models/                    # Model architectures
│   │   ├── base.py               # Base model classes
│   │   ├── encoders.py           # Content and speaker encoders
│   │   ├── generators.py         # GAN and diffusion generators
│   │   └── vocoders.py           # Vocoder implementations
│   └── incremental.py            # Incremental learning module
├── models/                        # Main model implementations
│   ├── baseline_vc.py            # Baseline VAE-GAN model
│   ├── vowel_vc.py               # VowelVC implementation
│   └── configs/                  # Model configurations
├── training/                      # Training components
│   ├── trainer.py                # Base trainer class
│   ├── vowel_trainer.py          # VowelVC-specific trainer
│   ├── datasets.py               # Dataset implementations
│   └── losses.py                 # Loss functions
├── evaluation/                    # Evaluation framework
│   ├── metrics.py                # Objective metrics
│   ├── subjective.py             # Subjective evaluation
│   └── paper_comparison.py       # Paper-style evaluation
├── scripts/                       # Utility scripts
│   ├── train_vowelvc.py          # Training script
│   ├── inference.py              # Inference script
│   └── evaluate.py               # Evaluation script
├── configs/                       # Configuration files
├── extract_features.py           # Feature extraction
└── requirements.txt              # Dependencies

Configuration

Model Configuration

Edit configs/vowel_vc.yaml to customize model parameters:

model:
  name: "VowelVC"
  mel_dim: 80
  speaker_dim: 192
  latent_dim: 64
  vowel_dim: 32
  generator_type: "gan"  # or "diffusion"

training:
  epochs: 100
  batch_size: 16
  learning_rate: 0.0002
  
data:
  mel_dir: "data/processed/mel"
  speaker_embed_dir: "data/processed/spk_embed"

Training Stages

VowelVC employs a three-stage training approach:

Stage 1: General model training on diverse speaker data
Stage 2: Vowel-specific adaptation and memory optimization
Stage 3: Incremental learning capability development

Advanced Usage

Custom Dataset Integration

Prepare your audio data in the following structure:

data/raw/
├── speaker1/
│   ├── audio1.wav
│   ├── audio2.wav
│   └── ...
├── speaker2/
│   └── ...

Extract features for your dataset:

python extract_features.py --input_dir data/raw/your_dataset

Update configuration files to point to your data paths.

Model Adaptation

For speaker adaptation with minimal data:

from models.vowel_vc import VowelVC
import torch

# Load pre-trained model
model = VowelVC.from_pretrained("experiments/vowelvc/final_model.pth")

# Adapt to new speaker with single utterance
model.adapt_single_utterance(source_mel, target_speaker_embedding)

# Perform conversion
converted_audio = model.convert_voice(source_mel, target_speaker_embedding)

Integration with Other Vocoders

from core.models.vocoders import HiFiGANVocoder

# Use HiFi-GAN for higher quality synthesis
vocoder = HiFiGANVocoder()
audio = vocoder(converted_mel_spectrogram)

Evaluation Metrics

The evaluation framework provides comprehensive metrics:

Objective Metrics:
- Log Spectral Distance (LSD)
- Mel Cepstral Distortion (MCD)
- Speaker Similarity (cosine similarity)
Subjective Metrics:
- Mean Opinion Score (MOS)
- Preference Tests (Quality and Similarity)

Experimental Results

Comparison with State-of-the-Art

Method	LSD (dB)	MOS	Year
AutoVC	10-15	3.2-3.5	2019
AdaIN-VC	8-12	3.5-3.8	2021
VowelVC	7.5-8.1	3.8-3.9	2025

Ablation Studies

Component contributions to overall performance:

Vowel Memory Bank: 15-20% LSD improvement
Incremental Adapter: 40% few-shot performance gain
Enhanced Content Encoder: 8-12% MOS improvement

Acknowledgments

Original ICGMM approach by Jannati and Sayadiyan (2015)
CMU ARCTIC dataset for evaluation
SpeechBrain team for speaker embedding models
PyTorch team for the deep learning framework

Contact

For questions and support:

Issues: GitHub Issues
Email: parham.soltany@gmail.com
Research Group: Mr Jannati

Changelog

Version 1.0.0

Initial release with VowelVC implementation
Comprehensive evaluation framework
Paper-quality experimental results
Single utterance adaptation capability

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
vowel_vc		vowel_vc
README.md		README.md

parhamsoltani/VowelVC

Folders and files

Latest commit

History

Repository files navigation