You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To have enough batch size depending on GPUs used, my use cases have been:
transformer-base in 12GB 4xGPUs: 8000 workspace and optmizer-delay 4.
transformer-base in 12GB 1xGPU: 8000 workspace and optimizer-delay 8.
transformer-big in 40GB 4xGPU: 30000 workspace and optimizer-delay 2.
The vocabulary part would be:
dim-vocabs:
- 32000
- 32000vocabs:
- vocab.spm # share vocab
- vocab.spmtied-embeddings-all: true # tie source embeddings with target and output embeddings
to share sentencepiece vocab and embeddings. This gives a very easy to use marian, as it only needs raw text as input and it handles all the tokenization.
For languages that don't share script:
dim-vocabs:
- 32000
- 32000vocabs:
- vocab.src.spm # separated vocab
- vocab.trg.spmtied-embeddings: true # tie only target and output embeddings
We might want to enable byte fallback for all the sentencepiece vocabs to mitigate broken outputs when some strange character comes in:
sentence-piece-options: '--byte_fallback'
The text was updated successfully, but these errors were encountered:
This are mostly borrowed from the predefined aliases in Marian in the
--task
option.transformer-base
transformer-big
To have enough batch size depending on GPUs used, my use cases have been:
The vocabulary part would be:
to share sentencepiece vocab and embeddings. This gives a very easy to use marian, as it only needs raw text as input and it handles all the tokenization.
For languages that don't share script:
We might want to enable byte fallback for all the sentencepiece vocabs to mitigate broken outputs when some strange character comes in:
The text was updated successfully, but these errors were encountered: