Install the code base and the dependencies

git clone https://github.com/yynil/RWKVTTS
git clone https://github.com/yynil/CosyVoice

Add these two directories to the PYTHONPATH

export PYTHONPATH=$PYTHONPATH:/home/user/CosyVoice:/home/user/RWKVTTS

Install the dependencies

conda create -n rwkvtts-311 -y python=3.11
conda activate rwkvtts-311
conda install -y -c conda-forge pynini==2.1.6
sudo apt install cuda-12-6  
cd RWKVTTS
pip install -r rwkvtts_requirements.txt

Download the pretrained models from the following links: https://huggingface.co/yueyulin/rwkv-tts-base

Place the CosyVoice2-0.5B_RWKV_0.19B to local directory. Let's say /home/user/CosyVoice2-0.5B_RWKV_0.19B

Add two directories to the PYTHONPATH

The example code for inference is as follows:

def do_tts(tts_text,prompt_texts,cosyvoice):
    import logging
    for i, (prompt_audio_file, prompt_text) in enumerate(zip(prompt_audios, prompt_texts)):
        logging.info(f'Processing {prompt_text}')
        prompt_speech_16k = load_wav(prompt_audio_file, 16000)
        with torch.no_grad():
            if prompt_text is not None:
                for j, k in enumerate(cosyvoice.inference_zero_shot(tts_text,prompt_text, prompt_speech_16k, stream=False,speed=1)):
                    torchaudio.save('zero_{}_{}.wav'.format(i, j), k['tts_speech'], cosyvoice.sample_rate)
            else:
                for j, k in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False,speed=1)):
                    torchaudio.save('zero_{}_{}.wav'.format(i, j), k['tts_speech'], cosyvoice.sample_rate)
        logging.info(f'Finished processing {prompt_text}')
if __name__ == '__main__':
    from cosyvoice.cli.cosyvoice import CosyVoice2
    import torch
    import sys
    # model_path = '/home/yueyulin/models/CosyVoice2-0.5B_RWKV_0.19B/'
    # device = 'cuda:0'
    print(sys.argv)
    model_path = sys.argv[1]
    device = sys.argv[2] if len(sys.argv) > 2 else 'cuda:0'
    is_flow_only = sys.argv[3]=='True' if len(sys.argv) > 3 else False
    print(f'is_flow_only: {is_flow_only}')
    cosyvoice = CosyVoice2(model_path,device=device,fp16=False,load_jit=False)
    
    from cosyvoice.utils.file_utils import load_wav
    import torchaudio
    prompt_audios = [
        '/home/yueyulin/github/RWKVTTS/zero_shot_prompt.wav',
        '/home/yueyulin/github/RWKVTTS/mine.wav',
        '/home/yueyulin/github/RWKVTTS/new.wav',
        '/home/yueyulin/github/RWKVTTS/Trump.wav',
    ]
    
    if not is_flow_only:
        prompt_texts = [
            '希望你以后做的比我还好呦。',
            '少年强则中国强。',
            '我随便说一句话，我喊开始录就开始录。',
            'numbers of Latino, African American, Asian American and native American voters.'
        ]
    else:
        prompt_texts = [
            None,
            None,
            None,
            None
        ]
    do_tts('Make America great again!',prompt_texts,cosyvoice)

If you pass the prompt_texts as None, the engine will only clone the voice flow and texture which is good to clone voice cross lingual. If you pass the correct prompt texts to the engine, the engine will try to continue to finish the audio tokens following the prompt audio you provided. This will be good to continue the audio you provided but it will be weird when you try to mix languages.

The test source code is test code.

Please change the paths to the correct paths in your system.

You can also use your own prompt audio and text. Since the llm module is to finish your audio tokens for you, so please make sure the audio is clean,complete and the text is correct. Otherwise, the result may not be good.

The following table shows the example results of the above code:

Prompt Audio	Prompt Text	TTS Text	Result
https://github.com/yynil/RWKVTTS/raw/main/zero_shot_prompt.wav	希望你以后做的比我还好呦。	中国在东亚，是世界上最大的国家，也是世界上人口最多的国家。	https://github.com/yynil/RWKVTTS/raw/main/zero_0_0.wav
https://github.com/yynil/RWKVTTS/raw/main/mine.wav	少年强则中国强。	中国在东亚，是世界上最大的国家，也是世界上人口最多的国家。	https://github.com/yynil/RWKVTTS/raw/main/zero_1_0.wav
https://github.com/yynil/RWKVTTS/raw/main/new.wav	我随便说一句话，我喊开始录就开始录。	中国在东亚，是世界上最大的国家，也是世界上人口最多的国家。	https://github.com/yynil/RWKVTTS/raw/main/zero_2_0.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference.md

Inference.md

Install the code base and the dependencies

Install the dependencies

Files

Inference.md

Latest commit

History

Inference.md

File metadata and controls

Install the code base and the dependencies

Install the dependencies