VoiceCraftST is a Python API that integrates the advanced text-to-speech capabilities of VoiceCraft with SillyTavern. This API enables the use of VoiceCraft's features within SillyTavern without any need for modifications on the SillyTavern platform.
- 2024.4.27 Implemented sliding window and bumped to latest version of VC
VoiceCraftST is supported on Linux systems with NVIDIA GPUs to avoid complex dependencies required for VoiceCraft model.
- Docker: Install Docker by following the guide on Docker's official website.
- NVIDIA Docker: Required for NVIDIA GPU support. Installation instructions can be found on NVIDIA Docker's GitHub repository.
-
Clone the Repository: Clone VoiceCraftST and initialize its submodule:
git clone https://github.com/Keeo/VoiceCraftST.git cd VoiceCraftST git submodule init git submodule update
-
Deploy Using Docker-Compose: Navigate to the directory containing
docker-compose.yaml
and execute:docker-compose up
This command builds the necessary Docker images and starts the service, making VCST available at port 5000 on your local machine.
Configure SillyTavern to use VoiceCraftST by following these steps:
- Choose TTS Provider: In SillyTavern settings, select 'XTTSv2' as the TTS provider.
- TTS Features:
- Enable TTS: Ensure the 'Enabled' checkbox is checked.
- Auto Generation: Activate automatic speech generation from text.
- Narration and Text Preferences: Configure preferences for narrating user messages, handling special text formats (like quotes or code), and processing text with special characters.
-
Provider Endpoint Configuration: Set up the XTTS endpoint with specific parameters to optimize performance:
http://localhost:5000/{username}/{stop_repetition}/{sample_batch_size}
Customize settings such as
username
,stop_repetition
, andsample_batch_size
based on your requirements.username
: Choose any username you want, it is used as a key for rest of the configuration.stop_repetition
: if the model generate long silence, reduce the stop_repetition to 3, 2 or even 1 (default: 3)sample_batch_size
: if the if there are long silence or unnaturally strecthed words, increase sample_batch_size to 5 or higher. What this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest. So if the speech rate of the generated is too fast change it to a smaller number. (default: 4)
-
Parameter Tuning: Adjust TTS parameters like
Temperature
,Top P
andTop K
to fine-tune the speech generation characteristics.- Speed: Does nothing.
- Temperature: Sets the variance in speech generation. (default: 1.0)
- Length Penalty: Does nothing.
- Repetition Penalty: Does nothing.
- Top K: (default: 0)
- Top P: (default: 0.8)
- Stream Chunk Size: Does nothing and streaming has not been implemented.
For better handling of large text blocks, enable the 'Text Splitting' feature to ensure smooth and continuous narration. More at VoiceCraft#39. (default: True)
To add new voices to the system:
- Prepare Voice Sample: Record a mono WAV file at 16,000 Hz sample rate.
- Create Transcript: Accurately transcribe the voice sample.
- Upload Command: Use the following command to upload the new voice sample and its transcript:
Replace placeholders with actual data. After uploading, reload SillyTavern to access the new voice.
make upload SPEAKER_NAME=sample TRANSCRIPT="Your transcript here." FILE_PATH=sample.wav
In current state it requires 24gb GPU VRAM requirements for training, finetuning, and inference? but it is likely to go down. Speed is slower compared to XTTSv2 and with TopP below 0.85 it likes to generate long silences which decrease the speed even further. sample_batch_size
parameter is supposed to help with that by brute-force and generating multiple alternatives and picking shortest.