-
-
Notifications
You must be signed in to change notification settings - Fork 3
Configure a model
The parameters for training a model are stored in the experiment folder in a file named 'config.yml'. The file uses the YAML format. Related settings are grouped together in sections.
These are the sections of a config file.
data:
eval:
infer:
model:
params:
score:
train:
It is not necessary to specify options for all of these sections for every training. Only those with parameters which differ from the default values need to be specified.
A minimal config.yml file looks like this:
data:
corpus_pairs:
- type: train,val,test
src: src-text
trg: trg-text
share_vocab: false
src_vocab_size: 24000
trg_vocab_size: 32000
This minimal config file provides these instructions to the system. Train a model to translate between src and trg languages. Split the texts into three parts one for training, one for validation and one for test. Use the default sizes for the validation and test sets and all the remaining data for the training. Create a separate vocab file for the source and target languages. Instruct sentencepiece to create a source vocab of 24000 tokens and to create a target vocab of 32000 tokens. Use the defaults for all the other settings including the default model architecture and default early stopping conditions.
More information about how to configure training can be found in the OpenNMT-tf documentation Another way to learn how to configure training is by examining an effective config file.
The parallel text available for low resource languages are translations of Scripture that are aligned by verse reference.
When the aligned Scripture files are used as a corpus pair it is possible to select parts of the data for training and testing without having to split the text files prior to training.
We have added a corpus_books
config option for this function. There is also a similar option to specify which books to include in the test set test_books
.
The example below shows the corpus_pairs section for restricting the entire model to only the data in the New Testament. The training, validation and test sets are all drawn only from that data.
corpus_pairs:
- type: train,test,val
corpus_books: NT
src: src-bible
trg: trg-bible
val_size: 250
test_size: 250
The following is an example showing how to specify a corpus_pairs
to use the New Testament, Genesis and Psalms for the training and validation sets. It also shows how to restrict the test set to verses from the book of Exodus.
corpus_pairs:
- type: train,val,test
corpus_books: NT,GEN,PSA
src: src-bible
trg: trg-bible
val_size: 250
test_books: EXO
test_size: 250
seed: 111
In this example the book of Exodus is reserved for the test set and the remaining books of the Bible are available for training and validation. The test_books
parameter excludes the books listed there from appearing in the Training or Validation sets. So even though only 250 verses of Exodus are used for the test set non of the remaining verses are included in either the training or validation sets. Therefore the test_books
parameter may be used to restrict the training to a smaller set of data without having to modify the data files.
No error is raised if you specify a test_size
larger than the number of verses in the test_books
. In that case all of the verses in the test_books
will be used as the test set.
model: SILTransformerBase
data:
corpus_pairs:
- type: train,val,test
src: src-bible
trg: trg-bible
val_size: 250
test_books: EXO
test_size: 250
Alternative syntax for corpus_books and test_books to use chapter specification, book ranges, and subtraction.
In addition to using comma-separated lists to specify the books used for trianing and testing, it is also possible to specify data at the chapter level, with book ranges, and with subtraction. To do this, use a semicolon-separated list, where each section has one of the following formats:
- A comma-separated list of chapters and chapter ranges for a specific book, e.g.
MAT1,2,6-10
- A range of books, e.g.
GEN-DEU
- A single book or testament, e.g.
MAT
,OT
- To subtract some data from the selection, use one of the above types preceded by
-
, e.g.-MAT1-4
,-GEN-LEV
. Sections are evaluated in the order that they appear, so make sure the selection being subtracted has already been added to the data set.
Examples:
GEN;EXO;LEV
OT;MAT-ROM;-ACT4-28
NT;-3JN
There are several ways to use more than one source in your experiment data. If you want to use different sources to get data from different parts of a text, you can define mulitple corpus pairs. This is useful when a source has incomplete data, or when you want to use different sources for training vs evaluation and testing.
data:
corpus_pairs:
- type: train,val,test
src: src-bible1
trg: trg-bible
corpus_books: GEN,EXO
test_books: LEV
- type: train,val,test
src: src-bible2
trg: trg-bible
corpus_books: NUM,DEU
test_books: JOS
If you instead want to use multiple sources but want to select data from the same portion of the texts, you can define a mixed-source corpus pair. This will equally and randomly choose verses from each text without overlap.
data:
corpus_pairs:
- mapping: mixed_src
type: train,val,test
src:
- src-bible1
- src-bible2
trg: trg-bible
corpus_books: GEN,EXO
test_books: LEV
Additionally, the many_to_many
mapping allows you to map multiple sources to multiple targets.
data:
corpus_pairs:
- mapping: many_to_many
type: train,val,test
src:
- src-bible1
- src-bible2
trg:
- trg-bible1
- trg-bible2
corpus_books: GEN,EXO
test_books: LEV
Abbreviations for Old Testament Books
GEN EXO LEV NUM DEU JOS JDG RUT 1SA 2SA 1KI 2KI 1CH 2CH EZR NEH EST JOB PSA PRO
ECC SNG ISA JER LAM EZK DAN HOS JOL AMO OBA JON MIC NAM HAB ZEP HAG ZEC MAL
Abbreviations for New Testament Books
MAT MRK LUK JHN ACT ROM 1CO 2CO GAL EPH PHP COL 1TH 2TH 1TI 2TI TIT PHM HEB JAS 1PE 2PE 1JN 2JN 3JN JUD REV
Abbreviations for Deutero cannonical Books
TOB JDT ESG WIS SIR BAR LJE S3Y SUS BEL 1MA 2MA 3MA 4MA 1ES 2ES MAN PS2 ODA PSS JSA JDB TBS SST DNT BLT
3ES EZA 5EZ 6EZ INT CNC GLO TDX NDX DAG PS3 2BA LBA JUB ENO 1MQ 2MQ 3MQ REP 4BA LAO
The seed parameter is used as a seed for a random number generator. The benefit of setting this explicitly is that the same random selection of Validation and Test set verses are chosen from the available data. Setting the seed means that other training runs using the makes it possible to compare the effect of changing other parameters against an identical test set. If this is not set explicitly then the training, validation and test sets contents' will vary between one training run and the next.
YAML is designed to be easy to read. It is useful to know that there are various ways to specify a list. Inline lists are separated with commas and square brackets are optional for a simple list. For a list that is too long for a single each item can be on a separate line preceded with a hyphen and a space.
These are three ways of indicating the same list:
test_books: GEN,EXO,LEV,NUM,DEU
test_books: [GEN,EXO,LEV,NUM,DEU]
test_books:
- GEN
- EXO
- LEV
- NUM
- DEU
The hyphen and space -
on the line after the corpus_pairs
parameter indicates that these settings are part of a list. In the examples above only one corpus pair is specified. Here is an example of a complete config.yml file, the one we used to train our German to English parent model. There are three corpus pairs one for each of the Training, Validation and Test sets.
model: SILTransformerBaseAlignmentEnhanced
data:
terms:
dictionary: true
corpus_pairs:
- type: train
src: de-WMT2020+Bibles
trg: en-WMT2020+Bibles
- type: val
src: de-newstest2014_ende
trg: en-newstest2014_ende
- type: test
src: de-newstest2017_ende
trg: en-newstest2017_ende
seed: 111
share_vocab: false
src_casing: lower
src_vocab_size: 32000
trg_casing: preserve
trg_vocab_size: 32000
params:
coverage_penalty: 0.1
word_dropout: 0
train:
keep_checkpoint_max: 5
max_step: 1000000
sample_buffer_size: 10000000
eval:
steps: 10000
export_on_best: bleu
early_stopping: null
export_format: checkpoint
max_exports_to_keep: 100
The configuration file will be read by the preprocessing and the training part of the silnlp pipeline. During preprocessing the source and target files will be read, or an error presented if any of them can't be found. The SIL_NLP_DATA_PATH
environment variable must be set to point to the root folder, and the source and target files must be in one of these two subfolders:
$SIL_NLP_DATA_PATH/MT/corpora
$SIL_NLP_DATA_PATH/MT/scripture
See this page for details about the required folder structure. Once the files have been read SentencePiece will be run to tokenize the source and target data. If it is unable to create a vocabulary for either of those of the size specified it will raise a Vocabulary size too high error. Once SentencePiece has created the vocabularies (i.e. list of tokens) to be used they will be saved in files with these names:
src-onmt.vocab
src-sp.model
src-sp.vocab
trg-onmt.vocab
trg-sp.model
trg-sp.vocab
Then files required for training and validation will be tokenized using those SentencePiece models and the files written to the experiment folder. These are named:
train.src.txt
train.trg.txt
val.src.txt
val.trg.txt
For the test set different files are written because the detokenized version of the target file is required for calculating scores.
test.src.txt
test.trg.detok.txt
It is the seed that is used to allocate the specified number of verses or lines to each of the Validation and Test set files. The remainder will be placed in the Training set. For Scripture files the verse references used for each set are stored in these files:
train.vref.txt
val.vref.txt
test.vref.txt
This facilitates the testing of the same selection of verses in different experiments producing test results that are not affected by the random selection of test verses. In that way we can be sure that differences in test results not due to the differences in the choice of verses for the test set.
The effective config file is created as soon as the training begins. A good way to learn about all the default parameters is to compare a simple config file like this one to the effective config that it creates. Although there may be more than 100 parameters in the effective config file they all have default values. Typically we've found very few areas where we can get better results by changing a default value. They have been the subject of many experiments and are chosen by the OpenNMT project according to the results of the latest research.
The following are some parameters that can be useful to change when running experiments for the purpose of testing during development. This is mostly to reduce training time while still making sure each part of the process is run.
eval:
eval_steps
per_device_eval_batch_size
infer:
infer_batch_size
params:
warmup_steps
train:
max_steps
num_train_epochs
per_device_train_batch_size
save_steps
-
save_steps
determines how often a model checkpoint is saved during training. For example, if you wanted to quickly get a model to inference with, you could set bothmax_steps
andsave_steps
to 100.