Skip to content

Configure a model

Isaac Schifferer edited this page Jan 4, 2024 · 55 revisions

How to configure the training of a model.

The parameters for training a model are stored in the experiment folder in a file named 'config.yml'. The file uses the YAML format. Related settings are grouped together in sections.

Sections of a config file.

These are the sections of a config file.

data:
eval:
infer:
model:
params:
score:
train:

It is not necessary to specify options for all of these sections for every training. Only those with parameters which differ from the default values need to be specified.

A minimal config.yml file looks like this:

data:
  corpus_pairs:
  - type: train,val,test
    src: src-text
    trg: trg-text
  share_vocab: false
  src_vocab_size: 24000
  trg_vocab_size: 32000

This minimal config file provides these instructions to the system. Train a model to translate between src and trg languages. Split the texts into three parts one for training, one for validation and one for test. Use the default sizes for the validation and test sets and all the remaining data for the training. Create a separate vocab file for the source and target languages. Instruct sentencepiece to create a source vocab of 24000 tokens and to create a target vocab of 32000 tokens. Use the defaults for all the other settings including the default model architecture and default early stopping conditions.

More information about how to configure training can be found in the OpenNMT-tf documentation Another way to learn how to configure training is by examining an effective config file.

Selection of books or chapters for training on Scripture data.

The parallel text available for low resource languages are translations of Scripture that are aligned by verse reference.

When the aligned Scripture files are used as a corpus pair it is possible to select parts of the data for training and testing without having to split the text files prior to training. We have added a corpus_books config option for this function. There is also a similar option to specify which books to include in the test set test_books.

The example below shows the corpus_pairs section for restricting the entire model to only the data in the New Testament. The training, validation and test sets are all drawn only from that data.

  corpus_pairs:
  - type: train,test,val
    corpus_books: NT 
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_size: 250

The following is an example showing how to specify a corpus_pairs to use the New Testament, Genesis and Psalms for the training and validation sets. It also shows how to restrict the test set to verses from the book of Exodus.

  corpus_pairs:
  - type: train,val,test
    corpus_books: NT,GEN,PSA
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250
  seed: 111

In this example the book of Exodus is reserved for the test set and the remaining books of the Bible are available for training and validation. The test_books parameter excludes the books listed there from appearing in the Training or Validation sets. So even though only 250 verses of Exodus are used for the test set non of the remaining verses are included in either the training or validation sets. Therefore the test_books parameter may be used to restrict the training to a smaller set of data without having to modify the data files.

No error is raised if you specify a test_size larger than the number of verses in the test_books. In that case all of the verses in the test_books will be used as the test set.

model: SILTransformerBase
data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250

Alternative syntax for corpus_books and test_books to use chapter specification, book ranges, and subtraction.

In addition to using comma-separated lists to specify the books used for trianing and testing, it is also possible to specify data at the chapter level, with book ranges, and with subtraction. To do this, use a semicolon-separated list, where each section has one of the following formats:

  • A comma-separated list of chapters and chapter ranges for a specific book, e.g. MAT1,2,6-10
  • A range of books, e.g. GEN-DEU
  • A single book or testament, e.g. MAT, OT
  • To subtract some data from the selection, use one of the above types preceded by -, e.g. -MAT1-4, -GEN-LEV. Sections are evaluated in the order that they appear, so make sure the selection being subtracted has already been added to the data set.

Examples:

GEN;EXO;LEV
OT;MAT-ROM;-ACT4-28
NT;-3JN

Using Multiple Sources

There are several ways to use more than one source in your experiment data. If you want to use different sources to get data from different parts of a text, you can define mulitple corpus pairs. This is useful when a source has incomplete data, or when you want to use different sources for training vs evaluation and testing.

data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible1
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV
  - type: train,val,test
    src: src-bible2
    trg: trg-bible
    corpus_books: NUM,DEU
    test_books: JOS

If you instead want to use multiple sources but want to select data from the same portion of the texts, you can define a mixed-source corpus pair. This will equally and randomly choose verses from each text without overlap.

data:
  corpus_pairs:
  - mapping: mixed_src
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV

Additionally, the many_to_many mapping allows you to map multiple sources to multiple targets.

data:
  corpus_pairs:
  - mapping: many_to_many
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg:
    - trg-bible1
    - trg-bible2
    corpus_books: GEN,EXO
    test_books: LEV

A complete list of the possible abbreviations for the books of the Bible recognized by the code.

Abbreviations for Old Testament Books

GEN EXO LEV NUM DEU JOS JDG RUT 1SA 2SA 1KI 2KI 1CH 2CH EZR NEH EST JOB PSA PRO
ECC SNG ISA JER LAM EZK DAN HOS JOL AMO OBA JON MIC NAM HAB ZEP HAG ZEC MAL 

Abbreviations for New Testament Books

MAT MRK LUK JHN ACT ROM 1CO 2CO GAL EPH PHP COL 1TH 2TH 1TI 2TI TIT PHM HEB JAS 1PE 2PE 1JN 2JN 3JN JUD REV 

Abbreviations for Deutero cannonical Books

TOB JDT ESG WIS SIR BAR LJE S3Y SUS BEL 1MA 2MA 3MA 4MA 1ES 2ES MAN PS2 ODA PSS JSA JDB TBS SST DNT BLT 
3ES EZA 5EZ 6EZ INT CNC GLO TDX NDX DAG PS3 2BA LBA JUB ENO 1MQ 2MQ 3MQ REP 4BA LAO 

A note about the seed parameter.

The seed parameter is used as a seed for a random number generator. The benefit of setting this explicitly is that the same random selection of Validation and Test set verses are chosen from the available data. Setting the seed means that other training runs using the makes it possible to compare the effect of changing other parameters against an identical test set. If this is not set explicitly then the training, validation and test sets contents' will vary between one training run and the next.

A note about YAML files.

YAML is designed to be easy to read. It is useful to know that there are various ways to specify a list. Inline lists are separated with commas and square brackets are optional for a simple list. For a list that is too long for a single each item can be on a separate line preceded with a hyphen and a space.

These are three ways of indicating the same list:

    test_books: GEN,EXO,LEV,NUM,DEU

    test_books: [GEN,EXO,LEV,NUM,DEU]

    test_books:
    - GEN
    - EXO
    - LEV
    - NUM
    - DEU

The hyphen and space - on the line after the corpus_pairs parameter indicates that these settings are part of a list. In the examples above only one corpus pair is specified. Here is an example of a complete config.yml file, the one we used to train our German to English parent model. There are three corpus pairs one for each of the Training, Validation and Test sets.

model: SILTransformerBaseAlignmentEnhanced
data:
  terms:
    dictionary: true
  corpus_pairs:
  - type: train
    src: de-WMT2020+Bibles
    trg: en-WMT2020+Bibles
  - type: val
    src: de-newstest2014_ende
    trg: en-newstest2014_ende
  - type: test
    src: de-newstest2017_ende
    trg: en-newstest2017_ende
  seed: 111
  share_vocab: false
  src_casing: lower
  src_vocab_size: 32000
  trg_casing: preserve
  trg_vocab_size: 32000
params:
  coverage_penalty: 0.1
  word_dropout: 0
train:
  keep_checkpoint_max: 5
  max_step: 1000000
  sample_buffer_size: 10000000
eval:      
  steps: 10000
  export_on_best: bleu
  early_stopping: null 
  export_format: checkpoint
  max_exports_to_keep: 100

Preprocessing.

The configuration file will be read by the preprocessing and the training part of the silnlp pipeline. During preprocessing the source and target files will be read, or an error presented if any of them can't be found. The SIL_NLP_DATA_PATH environment variable must be set to point to the root folder, and the source and target files must be in one of these two subfolders:

$SIL_NLP_DATA_PATH/MT/corpora
$SIL_NLP_DATA_PATH/MT/scripture

See this page for details about the required folder structure. Once the files have been read SentencePiece will be run to tokenize the source and target data. If it is unable to create a vocabulary for either of those of the size specified it will raise a Vocabulary size too high error. Once SentencePiece has created the vocabularies (i.e. list of tokens) to be used they will be saved in files with these names:

src-onmt.vocab
src-sp.model
src-sp.vocab
trg-onmt.vocab
trg-sp.model
trg-sp.vocab

Then files required for training and validation will be tokenized using those SentencePiece models and the files written to the experiment folder. These are named:

train.src.txt
train.trg.txt
val.src.txt
val.trg.txt

For the test set different files are written because the detokenized version of the target file is required for calculating scores.

test.src.txt
test.trg.detok.txt

It is the seed that is used to allocate the specified number of verses or lines to each of the Validation and Test set files. The remainder will be placed in the Training set. For Scripture files the verse references used for each set are stored in these files:

train.vref.txt
val.vref.txt
test.vref.txt

This facilitates the testing of the same selection of verses in different experiments producing test results that are not affected by the random selection of test verses. In that way we can be sure that differences in test results not due to the differences in the choice of verses for the test set.

The Effective Config file.

The effective config file is created as soon as the training begins. A good way to learn about all the default parameters is to compare a simple config file like this one to the effective config that it creates. Although there may be more than 100 parameters in the effective config file they all have default values. Typically we've found very few areas where we can get better results by changing a default value. They have been the subject of many experiments and are chosen by the OpenNMT project according to the results of the latest research.

Helpful Paramaters for Development

The following are some parameters that can be useful to change when running experiments for the purpose of testing during development. This is mostly to reduce training time while still making sure each part of the process is run.

eval: 
  eval_steps
  per_device_eval_batch_size
infer: 
  infer_batch_size
params:
  warmup_steps
train:
  max_steps
  num_train_epochs
  per_device_train_batch_size
  save_steps
  • save_steps determines how often a model checkpoint is saved during training. For example, if you wanted to quickly get a model to inference with, you could set both max_steps and save_steps to 100.