Author: Mike Gazzaruso
License: GNU/GPL v3
Version: 0.5.2
This project is a video translation pipeline that extracts speech from a video, transcribes it, translates it, and generates a voice-cloned speech using AI. The generated speech is then overlaid on the original video, replacing the original voice while preserving background sounds. It utilizes:
- Whisper for speech-to-text transcription
- MBart for machine translation
- Tortoise-TTS for voice cloning and text-to-speech synthesis
- Demucs for separating vocals and background audio
- FFmpeg for audio and video processing
- Cache System for reusing previously generated voice latents and models
Note: Lip-sync support is planned for future releases.
- Automated speech-to-text transcription using Whisper
- Machine translation of transcribed text using MBart
- Voice cloning and text-to-speech synthesis with Tortoise-TTS
- Separation of vocals and background music using Demucs
- Caching mechanism to speed up repeated voice cloning processes
- Full video processing pipeline with FFmpeg
- Modular architecture for easy maintenance and extension
- Advanced Synchronization: Aligns translated audio with the original video timing using:
- Natural pause detection
- Adaptive speed adjustment
- Language-specific timing parameters
- Intelligent segment splitting
- Jupyter Notebook Support: Easy integration with Google Colab via example.ipynb
- Comprehensive Synchronization Metrics: Detailed analysis of synchronization quality
- Real-time Process Control: Ability to stop the translation process at any time
- Live Progress Updates: Detailed text display of process phases in real-time
- Enhanced Error Handling: Robust handling of file permissions and language codes
- Broad format support: Works with various video formats including iPhone videos (MOV), MP4, and others
Ensure you have the following installed:
- Python 3.8+
- FFmpeg (available via
apt
,brew
, orchoco
depending on your OS) - CUDA (if running on GPU, optional but recommended)
Create a virtual environment and install required dependencies:
python -m venv video_translate_env
source video_translate_env/bin/activate # On Windows use: video_translate_env\Scripts\activate
pip install -r requirements.txt
The project now includes a Streamlit-based graphical user interface for easier use:
# Run the Streamlit interface
streamlit run app.py
# Or use the convenience scripts
./run_streamlit.sh # On Linux/Mac
run_streamlit.bat # On Windows
For more details on the Streamlit interface, see README_STREAMLIT.md.
To process a video with voice cloning:
python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --voice-samples path/to/voice_samples
python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --max-speed 1.5 --min-speed 0.8 --pause-threshold -30 --min-pause-duration 200
Create a JSON file with your synchronization settings:
{
"max_speed_factor": 1.5,
"min_speed_factor": 0.8,
"pause_threshold": -30,
"min_pause_duration": 200,
"adaptive_timing": true,
"preserve_sentence_breaks": true
}
Then run:
python autodub.py --input path/to/video.mp4 --output path/to/output.mp4 --source-lang it --target-lang en --sync-config sync_settings.json
You can also use our Jupyter notebook for easy integration with Google Colab:
- Upload the autodub.ipynb notebook to Google Colab
- Follow the step-by-step instructions in the notebook
- Upload your video and voice samples
- Configure synchronization settings
- Run the translation process
- Download the translated video
--input
: Path to the input video file--output
: Path to save the translated video--source-lang
: Source language (e.g.,it
for Italian)--target-lang
: Target language (e.g.,en
for English)--voice-samples
: Directory containing.wav
files for voice cloning--no-cache
: Disable caching--clear-cache
: Clear all cached data--clear-voice-cache
: Clear only voice cache--keep-temp
: Keep temporary files after processing--sync-config
: Path to JSON file with synchronization configuration--max-speed
: Maximum speed factor for audio adjustment--min-speed
: Minimum speed factor for audio adjustment--pause-threshold
: dB threshold for pause detection--min-pause-duration
: Minimum pause duration in milliseconds--no-adaptive-timing
: Disable adaptive timing based on language--no-preserve-breaks
: Do not preserve sentence breaks
video_translator/
├── __init__.py # Package initialization
├── autodub.py # Main entry point
├── video_translator.py # VideoTranslator class
├── speech_recognition.py # Speech recognition module
├── translation.py # Translation module
├── voice_synthesis.py # Voice synthesis module
├── audio_processing.py # Audio processing module
├── sync_evaluation.py # Synchronization evaluation module
├── utils.py # Utility functions
└── autodub.ipynb # Jupyter notebook for Google Colab
- Extract audio: The script extracts the original audio from the video.
- Transcription: Whisper transcribes the speech.
- Translation: MBart translates the text into the target language.
- Voice Cloning & Speech Synthesis: Tortoise-TTS generates new audio with a cloned voice.
- AI Voice Separation: Demucs separates voice and background sounds.
- Merge Translated Audio: The new translated voice is combined with background audio.
- Reintegrate Audio & Video: The final audio is merged with the original video.
This project implements a caching mechanism to speed up repeated processing:
- Whisper model caching: The speech-to-text model is stored to avoid reloading.
- Voice conditioning latents caching: Tortoise-TTS voice latents are stored to prevent redundant computation.
- Preprocessed voice samples caching: Speeds up voice cloning across multiple runs.
To clear cached models and latents:
python autodub.py --clear-cache
To clear only the voice cache:
python autodub.py --clear-voice-cache
The modular architecture makes it easy to extend the project:
- Add new languages: Update the language map in
utils.py
- Improve voice synthesis: Modify the voice synthesis module
- Change transcription model: Update the speech recognition module
- Add new features: Create new modules and integrate them into the workflow
- Lip Sync Support: The system will soon include precise lip synchronization.
This project is licensed under the GNU General Public License v3.0. You are free to modify and distribute it under the same terms.
- Mike Gazzaruso - Developer & Creator
- Open-source AI models from OpenAI, Hugging Face, and community contributors.
For inquiries or contributions, feel free to open an issue or a pull request on the project repository.