Skip to content

Training Polish Language - would changing the tokenizer help? #51

Answered by HobisPL
Kamil-Roszak asked this question in Q&A
Discussion options

You must be logged in to vote

I tested it and after changing the tokenizer, it is possible to obtain correct pronunciation in a given language, but the tokenizer needs to be changed during training and synthesis.

It is necessary to generate your own tokenizer.json in your language, preferably using an ebook that contains all the letters. DLAS provides a script for generating the tokenizer, but it does not work properly: DL-Art-School/codes/data/audio/voice_tokenizer.py.

For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val:
tokenizer_vocab: path/to/tokenizer_json

In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your o…

Replies: 3 comments 20 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
16 replies
@LorenzoBrugioni
Comment options

@pheonis2
Comment options

@LorenzoBrugioni
Comment options

@HobisPL
Comment options

@HobisPL
Comment options

Answer selected by 152334H
Comment options

You must be logged in to vote
4 replies
@LorenzoBrugioni
Comment options

@LorenzoBrugioni
Comment options

@LorenzoBrugioni
Comment options

@HobisPL
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
6 participants