-
-
Notifications
You must be signed in to change notification settings - Fork 3
Vocabulary size too high
If you see an error like this during preprocessing:
RuntimeError: Internal: src/trainer_interface.cc(590) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (24000). Please set it to a value <= 12638.
This indicates that Sentencepiece was not able to create a vocabulary of the size specified. There is no indication in the error message as to whether it is the source or target vocabulary which is set too high. So you may like to set those to different values initially so that you can see which needs to be modified. This issue often arises when we are working with the small corpus of a New Testament or Bible in a low resource language. When training a parent model with millions of sentences much larger vocabulary sizes are possible. For many of our experiments we've found that the maximum vocab size gives a good result.
If this config file :
data:
corpus_pairs:
- type: train,val,test
src: src-text
trg: trg-text
share_vocab: false
src_vocab_size: 24000
trg_vocab_size: 32000
caused the error above while being preprocessed then editing like this:
data:
corpus_pairs:
- type: train,val,test
src: src-text
trg: trg-text
share_vocab: false
src_vocab_size: 12638
trg_vocab_size: 32000
should solve the problem.