Video Translator with Voice Cloning and AI Voice Separation

Author: Mike Gazzaruso
License: GNU/GPL v3
Version: 0.5.2

Overview

This project is a video translation pipeline that extracts speech from a video, transcribes it, translates it, and generates a voice-cloned speech using AI. The generated speech is then overlaid on the original video, replacing the original voice while preserving background sounds. It utilizes:

Whisper for speech-to-text transcription
MBart for machine translation
Tortoise-TTS for voice cloning and text-to-speech synthesis
Demucs for separating vocals and background audio
FFmpeg for audio and video processing
Cache System for reusing previously generated voice latents and models

Note: Lip-sync support is planned for future releases.

Features

Automated speech-to-text transcription using Whisper
Machine translation of transcribed text using MBart
Voice cloning and text-to-speech synthesis with Tortoise-TTS
Separation of vocals and background music using Demucs
Caching mechanism to speed up repeated voice cloning processes
Full video processing pipeline with FFmpeg
Modular architecture for easy maintenance and extension
Advanced Synchronization: Aligns translated audio with the original video timing using:
- Natural pause detection
- Adaptive speed adjustment
- Language-specific timing parameters
- Intelligent segment splitting
Jupyter Notebook Support: Easy integration with Google Colab via example.ipynb
Comprehensive Synchronization Metrics: Detailed analysis of synchronization quality
Real-time Process Control: Ability to stop the translation process at any time
Live Progress Updates: Detailed text display of process phases in real-time
Enhanced Error Handling: Robust handling of file permissions and language codes
Broad format support: Works with various video formats including iPhone videos (MOV), MP4, and others

Installation

Prerequisites

Ensure you have the following installed:

Python 3.8+
FFmpeg (available via apt, brew, or choco depending on your OS)
CUDA (if running on GPU, optional but recommended)

Install dependencies

Create a virtual environment and install required dependencies:

python -m venv video_translate_env
source video_translate_env/bin/activate  # On Windows use: video_translate_env\Scripts\activate
pip install -r requirements.txt

Streamlit GUI (New!)

The project now includes a Streamlit-based graphical user interface for easier use:

# Run the Streamlit interface
streamlit run app.py

# Or use the convenience scripts
./run_streamlit.sh    # On Linux/Mac
run_streamlit.bat     # On Windows

For more details on the Streamlit interface, see README_STREAMLIT.md.

Usage

Run Translation on a Video

To process a video with voice cloning:

python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --voice-samples path/to/voice_samples

With Synchronization Options

python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --max-speed 1.5 --min-speed 0.8 --pause-threshold -30 --min-pause-duration 200

Using a Synchronization Configuration File

Create a JSON file with your synchronization settings:

{
  "max_speed_factor": 1.5,
  "min_speed_factor": 0.8,
  "pause_threshold": -30,
  "min_pause_duration": 200,
  "adaptive_timing": true,
  "preserve_sentence_breaks": true
}

Then run:

python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --sync-config sync_settings.json

Using Google Colab

You can also use our Jupyter notebook for easy integration with Google Colab:

Upload the autodub.ipynb notebook to Google Colab
Follow the step-by-step instructions in the notebook
Upload your video and voice samples
Configure synchronization settings
Run the translation process
Download the translated video

Options

--input : Path to the input video file
--output : Path to save the translated video
--source-lang : Source language (e.g., it for Italian)
--target-lang : Target language (e.g., en for English)
--voice-samples : Directory containing .wav files for voice cloning
--no-cache : Disable caching
--clear-cache : Clear all cached data
--clear-voice-cache : Clear only voice cache
--keep-temp : Keep temporary files after processing
--sync-config : Path to JSON file with synchronization configuration
--max-speed : Maximum speed factor for audio adjustment
--min-speed : Minimum speed factor for audio adjustment
--pause-threshold : dB threshold for pause detection
--min-pause-duration : Minimum pause duration in milliseconds
--no-adaptive-timing : Disable adaptive timing based on language
--no-preserve-breaks : Do not preserve sentence breaks

Project Structure

video_translator/
├── __init__.py             # Package initialization
├── autodub.py              # Main entry point
├── video_translator.py     # VideoTranslator class
├── speech_recognition.py   # Speech recognition module
├── translation.py          # Translation module
├── voice_synthesis.py      # Voice synthesis module
├── audio_processing.py     # Audio processing module
├── sync_evaluation.py      # Synchronization evaluation module
├── utils.py                # Utility functions
└── autodub.ipynb           # Jupyter notebook for Google Colab

Pipeline Workflow

Extract audio: The script extracts the original audio from the video.
Transcription: Whisper transcribes the speech.
Translation: MBart translates the text into the target language.
Voice Cloning & Speech Synthesis: Tortoise-TTS generates new audio with a cloned voice.
AI Voice Separation: Demucs separates voice and background sounds.
Merge Translated Audio: The new translated voice is combined with background audio.
Reintegrate Audio & Video: The final audio is merged with the original video.

Caching System

This project implements a caching mechanism to speed up repeated processing:

Whisper model caching: The speech-to-text model is stored to avoid reloading.
Voice conditioning latents caching: Tortoise-TTS voice latents are stored to prevent redundant computation.
Preprocessed voice samples caching: Speeds up voice cloning across multiple runs.

To clear cached models and latents:

python autodub.py --clear-cache

To clear only the voice cache:

python autodub.py --clear-voice-cache

Extending the Project

The modular architecture makes it easy to extend the project:

Add new languages: Update the language map in utils.py
Improve voice synthesis: Modify the voice synthesis module
Change transcription model: Update the speech recognition module
Add new features: Create new modules and integrate them into the workflow

Future Improvements

Lip Sync Support: The system will soon include precise lip synchronization.

License

This project is licensed under the GNU General Public License v3.0. You are free to modify and distribute it under the same terms.

Credits

Mike Gazzaruso - Developer & Creator
Open-source AI models from OpenAI, Hugging Face, and community contributors.

For inquiries or contributions, feel free to open an issue or a pull request on the project repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video Translator with Voice Cloning and AI Voice Separation

Overview

Features

Installation

Prerequisites

Install dependencies

Streamlit GUI (New!)

Usage

Run Translation on a Video

With Synchronization Options

Using a Synchronization Configuration File

Using Google Colab

Options

Project Structure

Pipeline Workflow

Caching System

Extending the Project

Future Improvements

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.streamlit		.streamlit
presentation		presentation
.gitignore		.gitignore
CHANGELOG.txt		CHANGELOG.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
README_STREAMLIT.md		README_STREAMLIT.md
VERSION.txt		VERSION.txt
__init__.py		__init__.py
app.py		app.py
audio_processing.py		audio_processing.py
autodub.ipynb		autodub.ipynb
autodub.py		autodub.py
requirements.txt		requirements.txt
run_streamlit.bat		run_streamlit.bat
run_streamlit.sh		run_streamlit.sh
speech_recognition.py		speech_recognition.py
sync_config.json		sync_config.json
sync_evaluation.py		sync_evaluation.py
translate_video.py		translate_video.py
translation.py		translation.py
utils.py		utils.py
video_translator.py		video_translator.py
voice_synthesis.py		voice_synthesis.py

License

mikegazzaruso/autodub

Folders and files

Latest commit

History

Repository files navigation

Video Translator with Voice Cloning and AI Voice Separation

Overview

Features

Installation

Prerequisites

Install dependencies

Streamlit GUI (New!)

Usage

Run Translation on a Video

With Synchronization Options

Using a Synchronization Configuration File

Using Google Colab

Options

Project Structure

Pipeline Workflow

Caching System

Extending the Project

Future Improvements

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages