Skip to content

The Dis-Vector project enhances voice conversion and synthesis through disentangled embeddings, allowing for high-quality, zero-shot voice cloning across multiple languages. This model leverages separate encoders for content, pitch, rhythm, and timbre, enabling precise control over synthesized voice characteristics.

Notifications You must be signed in to change notification settings

NN-Project-1/dis-Vector-Embedding

Repository files navigation

Dis-Vector: Disentangled Voice Embeddings for Conversion and Synthesis 🎤✨

Welcome to the Dis-Vector project! This repository contains the implementation and evaluation of our advanced voice conversion and synthesis model that utilizes disentangled embeddings to accurately capture and transfer speaker characteristics across languages.

📚 Table of Contents

  1. Overview
  2. Dis-Vector Model Details
  3. Datasets
  4. Evaluation
  5. Results
  6. MOS Score Analysis
  7. Conclusion

📝 Overview

The Dis-Vector model represents a significant advancement in voice conversion and synthesis by employing disentangled embeddings to precisely capture and transfer speaker characteristics. Its architecture features separate encoders for content, pitch, rhythm, and timbre, enhancing both the accuracy and flexibility of voice cloning.

🚀 Features

  • Disentangled Embeddings: Separate encoders for content, pitch, rhythm, and timbre.
  • Zero-Shot Capabilities: Effective voice cloning and conversion across different languages.
  • High-Quality Synthesis: Enhanced accuracy and flexibility in voice cloning.

📊 Sample Output Demo

Explore our live demo here showcasing the capabilities of the Dis-Vector model! This interactive demo allows you to experience the voice conversion and synthesis features in real-time. Users can listen to the synthesized audio samples generated by the model, highlighting its ability to accurately replicate and transform speaker characteristics across various languages and voices.

Embedding : https://github.com/NN-Project-1/dis-Vector-Embedding/tree/main/output

🛠️ Dis-Vector Model Details

The Dis-Vector model consists of several key components that work together to achieve effective voice conversion and synthesis:

  • Architecture: The model employs a multi-encoder architecture, with dedicated encoders for each feature type:

    • Content Encoder: Captures linguistic content and phonetic characteristics.
    • Pitch Encoder: Extracts pitch-related features to ensure accurate pitch reproduction.
    • Rhythm Encoder: Analyzes rhythmic patterns and timing to preserve the original speech flow.
    • Timbre Encoder: Captures unique vocal qualities of the speaker, allowing for more natural-sounding outputs.
  • Disentangled Embeddings: The model produces a 512-dimensional embedding vector, organized as follows:

    • 256 elements for content features
    • 128 elements for pitch features
    • 64 elements for rhythm features
    • 64 elements for timbre features
  • Zero-Shot Capability: The Dis-Vector model demonstrates remarkable zero-shot performance, enabling voice cloning and conversion across different languages without needing extensive training data for each target voice.

  • Feature Transfer: The model facilitates the transfer of individual features from the source voice to the target voice, allowing for customizable voice synthesis while retaining the original speech's essence.

  • Evaluation Metrics: Performance is assessed using various metrics, including Pitch Error Rate (PER), Rhythm Error Rate (RER), Timbre Error Rate (TER), and Content Preservation Rate (CPR), ensuring a comprehensive evaluation of the synthesized speech quality.

📂 Database

LIMMITS Dataset

  • Features recordings from speakers of various Indian languages (English, Hindi, Kannada, Telugu, Bengali).
  • Approximately 1 hour of speech data with around 400 utterances per language.

VCTK Dataset

  • Includes recordings from multiple speakers with different accents and regional variations.
  • Utilizes data from 6 male and 2 female speakers, each providing approximately 1 hour of speech.

📊 Evaluation

Quantitative analysis measures the performance of the Dis-Vector model using distance metrics and statistical measures.

1. Test Setup

  • Pitch Testing: Evaluates pitch variations using Pitch Error Rate (PER).
  • Rhythm Testing: Assesses rhythmic patterns with Rhythm Error Rate (RER).
  • Timbre Testing: Analyzes vocal qualities using Timbre Error Rate (TER).
  • Content Testing: Ensures content accuracy using Content Preservation Rate (CPR).

2. Distance Measurement

  • Cosine Similarity: Evaluates feature transfer and voice synthesis.

3. Ground Truth vs. TTS Output Similarity

  • Similarity scores for pitch, rhythm, timbre, and content help measure synthesis accuracy.

📈 Results

The results of our evaluation showcase the efficacy of the Dis-Vector model compared to traditional models.

MOS Score for Monolingual Voice Conversion

Source Language (Gender) Target Language (Gender) MOS Score
English Male English Female 3.8
Hindi Female Hindi Male 3.7

MOS Score for Zero-Shot Cross-Lingual Voice Cloning

Source Language (Gender) Target Language (Gender) MOS Score
English Male Hindi Female 3.9
Hindi Female Telugu Male 3.7

Comparison of DIS-Vector with D-Vector

Source Lang. Target Lang. MOS LIMMITS Baseline MOS (DIS Vector)
English English Female 3.5 3.9
Hindi Hindi Female 3.4 3.7

Zero-Shot Cross-Lingual Cloning

Source Lang. Target Lang. MOS LIMMITS Baseline MOS (DIS Vector)
English Hindi Female 3.3 3.8
Hindi English Female 3.2 3.6

Comparison with SpeechSplit2

Language SpeechSplit2 MOS Score DIS-Vector MOS Score
English Male 3.4 3.8
English Female 3.5 3.9

🏁 Conclusion

The Dis-Vector model's zero-shot capabilities enable effective voice cloning and conversion across different languages, setting a new benchmark for high-quality, customizable voice synthesis. The results of our experiments, including detailed embeddings and synthesis outputs, are available in the accompanying Git repository.

For more details, please refer to the documentation in this repository! Happy experimenting! 🚀

About

The Dis-Vector project enhances voice conversion and synthesis through disentangled embeddings, allowing for high-quality, zero-shot voice cloning across multiple languages. This model leverages separate encoders for content, pitch, rhythm, and timbre, enabling precise control over synthesized voice characteristics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages