GitHub - NN-Project-1/dis-Vector-Embedding: The Dis-Vector project enhances voice conversion and synthesis through disentangled embeddings, allowing for high-quality, zero-shot voice cloning across multiple languages. This model leverages separate encoders for content, pitch, rhythm, and timbre, enabling precise control over synthesized voice characteristics.

DIS-Vector - An Effective Low-Resource, Zero-Shot Approach for Controllable, End-to-End Voice Conversion and Cloning 🎤✨

Welcome to the DIS-Vector project! This repository presents an advanced low-resource, zero-shot voice conversion and cloning model that leverages disentangled embeddings, clustering techniques, and language-based similarity matching to achieve highly natural and controllable voice synthesis.

The DIS-Vector model introduces a novel approach to voice conversion by disentangling speech components content, pitch, rhythm, and timbre into separate embedding spaces, enabling fine-grained control over voice synthesis. Unlike traditional voice conversion models, DIS-Vector is capable of zero-shot voice cloning, meaning it can synthesize voices from unseen speakers and languages without requiring large-scale speaker-specific training data.

We have Approach 1, which is the base version of DIS-Vector. The details are provided here.

📝 Overview

The Dis-Vector model represents a significant advancement in voice conversion and synthesis by employing disentangled embeddings and clustering methodologies to precisely capture and transfer speaker characteristics. It introduces a novel language-based similarity approach and K-Means clustering for efficient speaker retrieval and closest language matching during inference.

🚀 Features

Disentangled Embeddings: Separate encoders for content, pitch, rhythm, and timbre.
Zero-Shot Capabilities: Effective voice cloning and conversion across different languages.
High-Quality Synthesis: Enhanced accuracy and flexibility in voice cloning.
K-Means Clustering: Optimized speaker embedding retrieval for inference.
Language-Based Similarity Matching: Determines the closest match from the embedding database to improve synthesis quality.

📊 Sample Output Demo

Explore our live demo here showcasing the capabilities of the Dis-Vector model! This interactive demo allows you to experience the voice conversion and synthesis features in real-time. Users can listen to the synthesized audio samples generated by the model, highlighting its ability to accurately replicate and transform speaker characteristics across various languages and voices.

🛠️ Dis-Vector Model Details

The Dis-Vector model consists of several key components that work together to achieve effective voice conversion and synthesis:

Architecture: The model employs a multi-encoder architecture, with dedicated encoders for each feature type:
- Content Encoder: Captures linguistic content and phonetic characteristics.
- Pitch Encoder: Extracts pitch-related features to ensure accurate pitch reproduction.
- Rhythm Encoder: Analyzes rhythmic patterns and timing to preserve the original speech flow.
- Timbre Encoder: Captures unique vocal qualities of the speaker, allowing for more natural-sounding outputs.
Disentangled Embeddings: The model produces a 512-dimensional embedding vector, organized as follows:
- 256 elements for content features
- 128 elements for pitch features
- 64 elements for rhythm features
- 64 elements for timbre features
Zero-Shot Capability: The Dis-Vector model demonstrates remarkable zero-shot performance, enabling voice cloning and conversion across different languages without needing extensive training data for each target voice.
Feature Transfer: The model facilitates the transfer of individual features from the source voice to the target voice, allowing for customizable voice synthesis while retaining the original speech's essence.

VITS-TTS Integration

Integrating VITS with DIS-Vector enhances its capabilities by leveraging disentangled embeddings of speech components (content, pitch, rhythm, and timbre). DIS-Vector provides fine-grained control over these components, enabling high-quality voice conversion and zero-shot voice cloning. This integration empowers VITS to generate speech in new voices, adapting to different speakers and languages without the need for speaker-specific training data, offering more flexibility and realism in synthetic speech generation.

Speech Component Representation

A speech signal ( s(t) ) is decomposed into four distinct components:

C(t) (Content): Represents linguistic information.
P(t) (Pitch): Corresponds to the fundamental frequency ( F_0 ).
R(t) (Rhythm): Captures duration and timing patterns.
T(t) (Timbre): Defines speaker identity characteristics.

Types of Loss Functions

The following loss functions are utilized in the model:

Mean Squared Error (MSE) Loss:
The MSE Loss is used for minimizing the difference between the predicted and actual values for continuous speech components such as pitch and timbre. It is applied as an overall reconstruction loss to ensure that the model accurately reconstructs these continuous components.

Kullback-Leibler (KL) Divergence Loss:
Measures the difference between two probability distributions, often used for speaker similarity matching and ensuring that the embeddings align with the desired distributions.

Disentanglement Loss:
Ensures that the learned embeddings for each speech component (content, pitch, rhythm, timbre) remain distinct and non-interfering, contributing to the overall performance of the model.To optimize the separation of speech components into distinct embedding spaces, the total loss function is defined as:

Where:

L_content ensures linguistic consistency.
L_pitch preserves fundamental frequency information.
L_rhythm maintains speech timing.
L_timbre preserves speaker identity characteristics.

📂 Database

LIMMITS Dataset

Features recordings from speakers of various Indian languages (English, Hindi, Kannada, Telugu, Bengali).
Approximately 1 hour of speech data with around 400 utterances per language.

VCTK Dataset

Includes recordings from multiple speakers with different accents and regional variations.
Utilizes data from 6 male and 2 female speakers, each providing approximately 1 hour of speech.

📊 Evaluation

Quantitative analysis measures the performance of the Dis-Vector model using distance metrics and statistical measures.

1. Test Setup

Pitch Testing: Evaluates pitch variations using Pitch Error Rate (PER).
Rhythm Testing: Assesses rhythmic patterns with Rhythm Error Rate (RER).
Timbre Testing: Analyzes vocal qualities using Timbre Error Rate (TER).
Content Testing: Ensures content accuracy using Content Preservation Rate (CPR).

2. Distance Measurement

Cosine Similarity: Evaluates feature transfer and voice synthesis.

3. Ground Truth vs. TTS Output Similarity

Similarity scores for pitch, rhythm, timbre, and content help measure synthesis accuracy.

🏗️ Clustering & Language Matching

K-Means Clustering for Speaker Embeddings

Dis-Vector utilizes a language-annotated speaker embedding database, where each speaker is mapped to a distinct feature representation based on their timbre and prosody characteristics. To enable efficient cross-speaker and cross-language voice conversion, we apply K-Means clustering on these high-dimensional embeddings. This clustering process helps to:

Group speakers based on intrinsic vocal attributes such as pitch, intonation, and articulation patterns.
Enable zero-shot voice conversion by leveraging cluster-based matching, even for unseen speakers.
Assign cluster centroids as representative embeddings, allowing the system to select the closest match for synthesis.
Improve generalization and adaptation by ensuring robust speaker variation capture while maintaining speaker identity.

By organizing the embedding space into well-defined clusters, Dis-Vector ensures a more structured and interpretable representation of speaker embeddings, enhancing the quality and accuracy of voice conversion.

Language-Based Similarity Matching

During inference, the model selects the most suitable speaker embedding by computing cosine similarity between the target speaker’s embedding and the pre-clustered speaker embeddings in the database. This method prioritizes selecting a linguistically similar speaker, leading to:

Better prosody preservation, as speakers from the same linguistic background share similar pitch and rhythm structures.
Accurate voice adaptation, ensuring that even when a target speaker’s language is unseen during training, the system can infer the best match.
Efficient feature transfer, allowing for natural-sounding synthesis without distorting speaker identity.

The language-based similarity approach refines the voice conversion process by focusing on both speaker similarity and linguistic consistency, ensuring the most natural and high-quality voice generation.

Closest Language Matching During Inference

To further enhance cross-lingual voice adaptation, Dis-Vector integrates a nearest language matching strategy. Given a target speaker's embedding, the system performs the following steps:

Determine the closest linguistic cluster by measuring the embedding distance to pre-computed cluster centroids.
Apply a threshold-based similarity measure to ensure the closest linguistic match is selected.
If a direct match is unavailable, the system chooses a linguistically nearest neighbor based on phonetic and prosody similarities.

This technique ensures:

Minimal loss in speech naturalness by selecting speakers with the most similar phonetic structures.
Improved speaker adaptation, even in cases where the target speaker’s language is underrepresented in the dataset.
Scalability for zero-shot voice conversion, allowing seamless expansion with new speakers and languages.

By leveraging this clustering-based framework, Dis-Vector significantly improves the accuracy and efficiency of voice conversion in multilingual and low-resource language settings, making it a robust solution for global voice synthesis applications.

📈 Results

The results of our evaluation showcase the efficacy of the Dis-Vector model compared to traditional models.

MOS Score for Monolingual Voice Conversion

Source Language (Gender)	Target Language (Gender)	MOS Score
English Male	English Female	3.8
Hindi Female	Hindi Male	3.7

MOS Score for Zero-Shot Cross-Lingual Voice Cloning

Source Language (Gender)	Target Language (Gender)	MOS Score
English Male	Hindi Female	3.9
Hindi Female	Telugu Male	3.7

Comparison of DIS-Vector with D-Vector

Source Lang.	Target Lang.	MOS LIMMITS Baseline	MOS (DIS Vector)
English	English Female	3.5	3.9
Hindi	Hindi Female	3.4	3.7

Comparison with SpeechSplit2

Language	SpeechSplit2 MOS Score	DIS-Vector MOS Score
English Male	3.4	3.8
English Female	3.5	3.9

🏁 Conclusion

The Dis-Vector model's zero-shot capabilities, enhanced by clustering and similarity-based speaker retrieval, enable effective voice cloning and conversion across different languages. It sets a new benchmark for high-quality, customizable voice synthesis.

For more details, refer to our documentation! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Dis_vector_model		Dis_vector_model
Finetuning		Finetuning
Limmits_fork		Limmits_fork
Sample Embedings		Sample Embedings
__pycache__		__pycache__
architecture		architecture
inference		inference
samples		samples
README.md		README.md
data_loader.py		data_loader.py
data_preprocessing.py		data_preprocessing.py
main.py		main.py
model.py		model.py
readme1.md		readme1.md
requirements.txt		requirements.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!