Skip to content

Latest commit

 

History

History
94 lines (83 loc) · 4.56 KB

Inference.md

File metadata and controls

94 lines (83 loc) · 4.56 KB

Install the code base and the dependencies

git clone https://github.com/yynil/RWKVTTS
git clone https://github.com/yynil/CosyVoice

Add these two directories to the PYTHONPATH

export PYTHONPATH=$PYTHONPATH:/home/user/CosyVoice:/home/user/RWKVTTS

Install the dependencies

conda create -n rwkvtts-311 -y python=3.11
conda activate rwkvtts-311
conda install -y -c conda-forge pynini==2.1.6
sudo apt install cuda-12-6  
cd RWKVTTS
pip install -r rwkvtts_requirements.txt

Download the pretrained models from the following links: https://huggingface.co/yueyulin/rwkv-tts-base

Place the CosyVoice2-0.5B_RWKV_0.19B to local directory. Let's say /home/user/CosyVoice2-0.5B_RWKV_0.19B

Add two directories to the PYTHONPATH

The example code for inference is as follows:

def do_tts(tts_text,prompt_texts,cosyvoice):
    import logging
    for i, (prompt_audio_file, prompt_text) in enumerate(zip(prompt_audios, prompt_texts)):
        logging.info(f'Processing {prompt_text}')
        prompt_speech_16k = load_wav(prompt_audio_file, 16000)
        with torch.no_grad():
            if prompt_text is not None:
                for j, k in enumerate(cosyvoice.inference_zero_shot(tts_text,prompt_text, prompt_speech_16k, stream=False,speed=1)):
                    torchaudio.save('zero_{}_{}.wav'.format(i, j), k['tts_speech'], cosyvoice.sample_rate)
            else:
                for j, k in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False,speed=1)):
                    torchaudio.save('zero_{}_{}.wav'.format(i, j), k['tts_speech'], cosyvoice.sample_rate)
        logging.info(f'Finished processing {prompt_text}')
if __name__ == '__main__':
    from cosyvoice.cli.cosyvoice import CosyVoice2
    import torch
    import sys
    # model_path = '/home/yueyulin/models/CosyVoice2-0.5B_RWKV_0.19B/'
    # device = 'cuda:0'
    print(sys.argv)
    model_path = sys.argv[1]
    device = sys.argv[2] if len(sys.argv) > 2 else 'cuda:0'
    is_flow_only = sys.argv[3]=='True' if len(sys.argv) > 3 else False
    print(f'is_flow_only: {is_flow_only}')
    cosyvoice = CosyVoice2(model_path,device=device,fp16=False,load_jit=False)
    
    from cosyvoice.utils.file_utils import load_wav
    import torchaudio
    prompt_audios = [
        '/home/yueyulin/github/RWKVTTS/zero_shot_prompt.wav',
        '/home/yueyulin/github/RWKVTTS/mine.wav',
        '/home/yueyulin/github/RWKVTTS/new.wav',
        '/home/yueyulin/github/RWKVTTS/Trump.wav',
    ]
    
    if not is_flow_only:
        prompt_texts = [
            '希望你以后做的比我还好呦。',
            '少年强则中国强。',
            '我随便说一句话,我喊开始录就开始录。',
            'numbers of Latino, African American, Asian American and native American voters.'
        ]
    else:
        prompt_texts = [
            None,
            None,
            None,
            None
        ]
    do_tts('Make America great again!',prompt_texts,cosyvoice)

If you pass the prompt_texts as None, the engine will only clone the voice flow and texture which is good to clone voice cross lingual. If you pass the correct prompt texts to the engine, the engine will try to continue to finish the audio tokens following the prompt audio you provided. This will be good to continue the audio you provided but it will be weird when you try to mix languages.

The test source code is test code.

Please change the paths to the correct paths in your system.

You can also use your own prompt audio and text. Since the llm module is to finish your audio tokens for you, so please make sure the audio is clean,complete and the text is correct. Otherwise, the result may not be good.

The following table shows the example results of the above code:

Prompt Audio Prompt Text TTS Text Result
https://github.com/yynil/RWKVTTS/raw/main/zero_shot_prompt.wav 希望你以后做的比我还好呦。 中国在东亚,是世界上最大的国家,也是世界上人口最多的国家。 https://github.com/yynil/RWKVTTS/raw/main/zero_0_0.wav
https://github.com/yynil/RWKVTTS/raw/main/mine.wav 少年强则中国强。 中国在东亚,是世界上最大的国家,也是世界上人口最多的国家。 https://github.com/yynil/RWKVTTS/raw/main/zero_1_0.wav
https://github.com/yynil/RWKVTTS/raw/main/new.wav 我随便说一句话,我喊开始录就开始录。 中国在东亚,是世界上最大的国家,也是世界上人口最多的国家。 https://github.com/yynil/RWKVTTS/raw/main/zero_2_0.wav