Skip to content

VowelVC: A Neural Vowel-Centric Voice Conversion Framework with Incremental Memory Learning for Few-Shot Speaker Adaptation

Notifications You must be signed in to change notification settings

parhamsoltani/VowelVC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

VowelVC: Vowel-based Voice Conversion with Few-shot Incremental Learning

A novel neural voice conversion system that leverages vowel-specific transformations and incremental memory learning for few-shot speaker adaptation.

Overview

VowelVC implements the core concepts of Incremental Conditional GMM based Voice Conversion using modern deep learning architectures (VAE, GAN, Diffusion). The system focuses on vowel-based transformations and supports minimal data adaptation.

Key Features

  • Vowel-based transformation: Uses vowel detection and transformation as core mechanism
  • Few-Shot Adaptation: Single utterance speaker adaptation capability through incremental learning
  • Multiple generators: Supports both GAN and Diffusion-based generation
  • Memory-Augmented Learning: Dynamic vowel memory bank with attention mechanisms
  • State-of-the-Art Performance: Outperforms contemporary neural voice conversion methods
  • Real-Time Inference: Efficient processing suitable for interactive applications
  • Comprehensive Evaluation: Rigorous experimental validation with multiple metrics

Performance Results

Objective Metrics (1-3 Training Sentences)

Conversion Type Baseline LSD (dB) VowelVC LSD (dB) Improvement
Male to Male 14.77 7.96 46.1%
Male to Female 15.25 7.52 50.7%
Female to Male 15.18 8.02 47.2%
Female to Female 14.50 8.06 44.5%

Subjective Metrics (MOS Scores)

Conversion Type Baseline MOS VowelVC MOS Improvement
Male to Male 3.36 3.84 14.3%
Male to Female 3.30 3.83 15.9%
Female to Male 3.30 3.91 18.7%
Female to Female 3.26 3.94 21.0%

Installation

Setup Instructions

  1. Clone the repository:
git clone https://github.com/yourusername/VowelVC.git
cd VowelVC
  1. Create a virtual environment:
python -m venv vowelvc_env
source vowelvc_env/bin/activate  # On Windows: vowelvc_env\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install additional audio processing libraries:
pip install librosa soundfile
  1. Optional - Install SpeechBrain for enhanced speaker embeddings:
pip install speechbrain

Quick Start

1. Data Preparation

Download Sample Data

# Download CMU ARCTIC dataset
python scripts/download_data.py --dataset arctic --output data/raw

Extract Features

# Extract mel spectrograms and speaker embeddings
python extract_features.py \
    --input_dir data/raw \
    --mel_out data/processed/mel \
    --spk_out data/processed/spk_embed

2. Training

Train VowelVC Model

# Basic training with default configuration
python scripts/train_vowelvc.py

# Training with three-stage approach (recommended)
python scripts/train_vowelvc.py --three-stages

# Training with custom configuration
python scripts/train_vowelvc.py --config configs/vowelvc_custom.yaml

Train Baseline Model (for comparison)

python scripts/train_baseline.py --config configs/baseline.yaml

3. Inference

Convert Single Audio File

python scripts/inference.py \
    --source path/to/source/audio.wav \
    --target-speaker path/to/target/speaker/audio.wav \
    --output path/to/converted/output.wav \
    --model experiments/vowelvc/final_model.pth \
    --model-type vowelvc

Batch Conversion

python scripts/batch_inference.py \
    --source_dir data/test/source \
    --target_dir data/test/targets \
    --output_dir results/converted \
    --model experiments/vowelvc/final_model.pth

Speaker Adaptation (Few-Shot Learning)

python scripts/inference.py \
    --source source_audio.wav \
    --target-speaker target_speaker.wav \
    --output converted_output.wav \
    --model experiments/vowelvc/final_model.pth \
    --model-type vowelvc \
    --adapt-speaker  # Enable single utterance adaptation

4. Evaluation

Comprehensive Evaluation

python scripts/evaluate.py \
    --baseline-model experiments/baseline/final_model.pth \
    --vowelvc-model experiments/vowelvc/final_model.pth \
    --output-dir evaluation/results \
    --n-samples 100

Generate Paper-Style Results

python scripts/evaluate.py \
    --vowelvc-model experiments/vowelvc/final_model.pth \
    --output-dir evaluation/paper_results \
    --paper-style

Project Structure

VowelVC/
├── core/                          # Core components
│   ├── models/                    # Model architectures
│   │   ├── base.py               # Base model classes
│   │   ├── encoders.py           # Content and speaker encoders
│   │   ├── generators.py         # GAN and diffusion generators
│   │   └── vocoders.py           # Vocoder implementations
│   └── incremental.py            # Incremental learning module
├── models/                        # Main model implementations
│   ├── baseline_vc.py            # Baseline VAE-GAN model
│   ├── vowel_vc.py               # VowelVC implementation
│   └── configs/                  # Model configurations
├── training/                      # Training components
│   ├── trainer.py                # Base trainer class
│   ├── vowel_trainer.py          # VowelVC-specific trainer
│   ├── datasets.py               # Dataset implementations
│   └── losses.py                 # Loss functions
├── evaluation/                    # Evaluation framework
│   ├── metrics.py                # Objective metrics
│   ├── subjective.py             # Subjective evaluation
│   └── paper_comparison.py       # Paper-style evaluation
├── scripts/                       # Utility scripts
│   ├── train_vowelvc.py          # Training script
│   ├── inference.py              # Inference script
│   └── evaluate.py               # Evaluation script
├── configs/                       # Configuration files
├── extract_features.py           # Feature extraction
└── requirements.txt              # Dependencies

Configuration

Model Configuration

Edit configs/vowel_vc.yaml to customize model parameters:

model:
  name: "VowelVC"
  mel_dim: 80
  speaker_dim: 192
  latent_dim: 64
  vowel_dim: 32
  generator_type: "gan"  # or "diffusion"

training:
  epochs: 100
  batch_size: 16
  learning_rate: 0.0002
  
data:
  mel_dir: "data/processed/mel"
  speaker_embed_dir: "data/processed/spk_embed"

Training Stages

VowelVC employs a three-stage training approach:

  1. Stage 1: General model training on diverse speaker data
  2. Stage 2: Vowel-specific adaptation and memory optimization
  3. Stage 3: Incremental learning capability development

Advanced Usage

Custom Dataset Integration

  1. Prepare your audio data in the following structure:
data/raw/
├── speaker1/
│   ├── audio1.wav
│   ├── audio2.wav
│   └── ...
├── speaker2/
│   └── ...
  1. Extract features for your dataset:
python extract_features.py --input_dir data/raw/your_dataset
  1. Update configuration files to point to your data paths.

Model Adaptation

For speaker adaptation with minimal data:

from models.vowel_vc import VowelVC
import torch

# Load pre-trained model
model = VowelVC.from_pretrained("experiments/vowelvc/final_model.pth")

# Adapt to new speaker with single utterance
model.adapt_single_utterance(source_mel, target_speaker_embedding)

# Perform conversion
converted_audio = model.convert_voice(source_mel, target_speaker_embedding)

Integration with Other Vocoders

from core.models.vocoders import HiFiGANVocoder

# Use HiFi-GAN for higher quality synthesis
vocoder = HiFiGANVocoder()
audio = vocoder(converted_mel_spectrogram)

Evaluation Metrics

The evaluation framework provides comprehensive metrics:

  • Objective Metrics:

    • Log Spectral Distance (LSD)
    • Mel Cepstral Distortion (MCD)
    • Speaker Similarity (cosine similarity)
  • Subjective Metrics:

    • Mean Opinion Score (MOS)
    • Preference Tests (Quality and Similarity)

Experimental Results

Comparison with State-of-the-Art

Method LSD (dB) MOS Year
AutoVC 10-15 3.2-3.5 2019
AdaIN-VC 8-12 3.5-3.8 2021
VowelVC 7.5-8.1 3.8-3.9 2025

Ablation Studies

Component contributions to overall performance:

  • Vowel Memory Bank: 15-20% LSD improvement
  • Incremental Adapter: 40% few-shot performance gain
  • Enhanced Content Encoder: 8-12% MOS improvement

Acknowledgments

  • Original ICGMM approach by Jannati and Sayadiyan (2015)
  • CMU ARCTIC dataset for evaluation
  • SpeechBrain team for speaker embedding models
  • PyTorch team for the deep learning framework

Contact

For questions and support:

Changelog

Version 1.0.0

  • Initial release with VowelVC implementation
  • Comprehensive evaluation framework
  • Paper-quality experimental results
  • Single utterance adaptation capability

About

VowelVC: A Neural Vowel-Centric Voice Conversion Framework with Incremental Memory Learning for Few-Shot Speaker Adaptation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages