Welcome to the Dis-Vector project! This repository contains the implementation and evaluation of our advanced voice conversion and synthesis model that utilizes disentangled embeddings to accurately capture and transfer speaker characteristics across languages.
The Dis-Vector model represents a significant advancement in voice conversion and synthesis by employing disentangled embeddings to precisely capture and transfer speaker characteristics. Its architecture features separate encoders for content, pitch, rhythm, and timbre, enhancing both the accuracy and flexibility of voice cloning.
- Disentangled Embeddings: Separate encoders for content, pitch, rhythm, and timbre.
- Zero-Shot Capabilities: Effective voice cloning and conversion across different languages.
- High-Quality Synthesis: Enhanced accuracy and flexibility in voice cloning.
Explore our live demo here showcasing the capabilities of the Dis-Vector model! This interactive demo allows you to experience the voice conversion and synthesis features in real-time. Users can listen to the synthesized audio samples generated by the model, highlighting its ability to accurately replicate and transform speaker characteristics across various languages and voices.
Embedding : https://github.com/NN-Project-1/dis-Vector-Embedding/tree/main/output
The Dis-Vector model consists of several key components that work together to achieve effective voice conversion and synthesis:
-
Architecture: The model employs a multi-encoder architecture, with dedicated encoders for each feature type:
- Content Encoder: Captures linguistic content and phonetic characteristics.
- Pitch Encoder: Extracts pitch-related features to ensure accurate pitch reproduction.
- Rhythm Encoder: Analyzes rhythmic patterns and timing to preserve the original speech flow.
- Timbre Encoder: Captures unique vocal qualities of the speaker, allowing for more natural-sounding outputs.
-
Disentangled Embeddings: The model produces a 512-dimensional embedding vector, organized as follows:
- 256 elements for content features
- 128 elements for pitch features
- 64 elements for rhythm features
- 64 elements for timbre features
-
Zero-Shot Capability: The Dis-Vector model demonstrates remarkable zero-shot performance, enabling voice cloning and conversion across different languages without needing extensive training data for each target voice.
-
Feature Transfer: The model facilitates the transfer of individual features from the source voice to the target voice, allowing for customizable voice synthesis while retaining the original speech's essence.
-
Evaluation Metrics: Performance is assessed using various metrics, including Pitch Error Rate (PER), Rhythm Error Rate (RER), Timbre Error Rate (TER), and Content Preservation Rate (CPR), ensuring a comprehensive evaluation of the synthesized speech quality.
- Features recordings from speakers of various Indian languages (English, Hindi, Kannada, Telugu, Bengali).
- Approximately 1 hour of speech data with around 400 utterances per language.
- Includes recordings from multiple speakers with different accents and regional variations.
- Utilizes data from 6 male and 2 female speakers, each providing approximately 1 hour of speech.
Quantitative analysis measures the performance of the Dis-Vector model using distance metrics and statistical measures.
- Pitch Testing: Evaluates pitch variations using Pitch Error Rate (PER).
- Rhythm Testing: Assesses rhythmic patterns with Rhythm Error Rate (RER).
- Timbre Testing: Analyzes vocal qualities using Timbre Error Rate (TER).
- Content Testing: Ensures content accuracy using Content Preservation Rate (CPR).
- Cosine Similarity: Evaluates feature transfer and voice synthesis.
- Similarity scores for pitch, rhythm, timbre, and content help measure synthesis accuracy.
The results of our evaluation showcase the efficacy of the Dis-Vector model compared to traditional models.
Source Language (Gender) | Target Language (Gender) | MOS Score |
---|---|---|
English Male | English Female | 3.8 |
Hindi Female | Hindi Male | 3.7 |
Source Language (Gender) | Target Language (Gender) | MOS Score |
---|---|---|
English Male | Hindi Female | 3.9 |
Hindi Female | Telugu Male | 3.7 |
Source Lang. | Target Lang. | MOS LIMMITS Baseline | MOS (DIS Vector) |
---|---|---|---|
English | English Female | 3.5 | 3.9 |
Hindi | Hindi Female | 3.4 | 3.7 |
Source Lang. | Target Lang. | MOS LIMMITS Baseline | MOS (DIS Vector) |
---|---|---|---|
English | Hindi Female | 3.3 | 3.8 |
Hindi | English Female | 3.2 | 3.6 |
Language | SpeechSplit2 MOS Score | DIS-Vector MOS Score |
---|---|---|
English Male | 3.4 | 3.8 |
English Female | 3.5 | 3.9 |
The Dis-Vector model's zero-shot capabilities enable effective voice cloning and conversion across different languages, setting a new benchmark for high-quality, customizable voice synthesis. The results of our experiments, including detailed embeddings and synthesis outputs, are available in the accompanying Git repository.
For more details, please refer to the documentation in this repository! Happy experimenting! 🚀