Skip to content

Configure a model

Isaac Schifferer edited this page Feb 20, 2024 · 55 revisions

How to configure the training of a model.

The parameters for training a model are stored in the experiment folder in a file named 'config.yml'. The file uses the YAML format. Related settings are grouped together in sections.

Sections of a config file.

These are the sections of a config file.

data:
eval:
infer:
model:
params:
train:

It is not necessary to specify options for all of these sections for every training. Only those with parameters which differ from the default values need to be specified. See Parameter Definitions for a full list of supported parameters and their definitions.

A minimal config.yml file looks like this:

data:
  corpus_pairs:
  - type: train,val,test
    src: src-text
    trg: trg-text
  share_vocab: false
  src_vocab_size: 24000
  trg_vocab_size: 32000

This minimal config file provides these instructions to the system. Train a model to translate between src and trg languages. Split the texts into three parts one for training, one for validation and one for test. Use the default sizes for the validation and test sets and all the remaining data for the training. Create a separate vocab file for the source and target languages. Instruct sentencepiece to create a source vocab of 24000 tokens and to create a target vocab of 32000 tokens. Use the defaults for all the other settings including the default model architecture and default early stopping conditions.

More information about how to configure training can be found in the OpenNMT-tf documentation Another way to learn how to configure training is by examining an effective config file.

Selection of books or chapters for training on Scripture data.

The parallel text available for low resource languages are translations of Scripture that are aligned by verse reference.

When the aligned Scripture files are used as a corpus pair it is possible to select parts of the data for training and testing without having to split the text files prior to training. We have added a corpus_books config option for this function. There is also a similar option to specify which books to include in the test set test_books.

The example below shows the corpus_pairs section for restricting the entire model to only the data in the New Testament. The training, validation and test sets are all drawn only from that data.

  corpus_pairs:
  - type: train,test,val
    corpus_books: NT 
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_size: 250

The following is an example showing how to specify a corpus_pairs to use the New Testament, Genesis and Psalms for the training and validation sets. It also shows how to restrict the test set to verses from the book of Exodus.

  corpus_pairs:
  - type: train,val,test
    corpus_books: NT,GEN,PSA
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250
  seed: 111

In this example the book of Exodus is reserved for the test set and the remaining books of the Bible are available for training and validation. The test_books parameter excludes the books listed there from appearing in the Training or Validation sets. So even though only 250 verses of Exodus are used for the test set non of the remaining verses are included in either the training or validation sets. Therefore the test_books parameter may be used to restrict the training to a smaller set of data without having to modify the data files.

No error is raised if you specify a test_size larger than the number of verses in the test_books. In that case all of the verses in the test_books will be used as the test set.

model: SILTransformerBase
data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250

Alternative syntax for corpus_books and test_books to use chapter specification, book ranges, and subtraction.

In addition to using comma-separated lists to specify the books used for trianing and testing, it is also possible to specify data at the chapter level, with book ranges, and with subtraction. To do this, use a semicolon-separated list, where each section has one of the following formats:

  • A comma-separated list of chapters and chapter ranges for a specific book, e.g. MAT1,2,6-10
  • A range of books, e.g. GEN-DEU
  • A single book or testament, e.g. MAT, OT
  • To subtract some data from the selection, use one of the above types preceded by -, e.g. -MAT1-4, -GEN-LEV. Sections are evaluated in the order that they appear, so make sure the selection being subtracted has already been added to the data set.

Examples:

GEN;EXO;LEV
OT;MAT-ROM;-ACT4-28
NT;-3JN

Using Multiple Sources

There are several ways to use more than one source in your experiment data. If you want to use different sources to get data from different parts of a text, you can define mulitple corpus pairs. This is useful when a source has incomplete data, or when you want to use different sources for training vs evaluation and testing.

data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible1
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV
  - type: train,val,test
    src: src-bible2
    trg: trg-bible
    corpus_books: NUM,DEU
    test_books: JOS

If you instead want to use multiple sources but want to select data from the same portion of the texts, you can define a mixed-source corpus pair. This will equally and randomly choose verses from each text without overlap.

data:
  corpus_pairs:
  - mapping: mixed_src
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV

Additionally, the many_to_many mapping allows you to map multiple sources to multiple targets.

data:
  corpus_pairs:
  - mapping: many_to_many
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg:
    - trg-bible1
    - trg-bible2
    corpus_books: GEN,EXO
    test_books: LEV

A complete list of the possible abbreviations for the books of the Bible recognized by the code.

Abbreviations for Old Testament Books

GEN EXO LEV NUM DEU JOS JDG RUT 1SA 2SA 1KI 2KI 1CH 2CH EZR NEH EST JOB PSA PRO
ECC SNG ISA JER LAM EZK DAN HOS JOL AMO OBA JON MIC NAM HAB ZEP HAG ZEC MAL 

Abbreviations for New Testament Books

MAT MRK LUK JHN ACT ROM 1CO 2CO GAL EPH PHP COL 1TH 2TH 1TI 2TI TIT PHM HEB JAS 1PE 2PE 1JN 2JN 3JN JUD REV 

Abbreviations for Deutero cannonical Books

TOB JDT ESG WIS SIR BAR LJE S3Y SUS BEL 1MA 2MA 3MA 4MA 1ES 2ES MAN PS2 ODA PSS JSA JDB TBS SST DNT BLT 
3ES EZA 5EZ 6EZ INT CNC GLO TDX NDX DAG PS3 2BA LBA JUB ENO 1MQ 2MQ 3MQ REP 4BA LAO 

A note about the seed parameter.

The seed parameter is used as a seed for a random number generator. The benefit of setting this explicitly is that the same random selection of Validation and Test set verses are chosen from the available data. Setting the seed means that other training runs using the makes it possible to compare the effect of changing other parameters against an identical test set. If this is not set explicitly then the training, validation and test sets contents' will vary between one training run and the next.

A note about YAML files.

YAML is designed to be easy to read. It is useful to know that there are various ways to specify a list. Inline lists are separated with commas and square brackets are optional for a simple list. For a list that is too long for a single each item can be on a separate line preceded with a hyphen and a space.

These are three ways of indicating the same list:

    test_books: GEN,EXO,LEV,NUM,DEU

    test_books: [GEN,EXO,LEV,NUM,DEU]

    test_books:
    - GEN
    - EXO
    - LEV
    - NUM
    - DEU

The hyphen and space - on the line after the corpus_pairs parameter indicates that these settings are part of a list. In the examples above only one corpus pair is specified. Here is an example of a complete config.yml file, the one we used to train our German to English parent model. There are three corpus pairs one for each of the Training, Validation and Test sets.

model: SILTransformerBaseAlignmentEnhanced
data:
  terms:
    dictionary: true
  corpus_pairs:
  - type: train
    src: de-WMT2020+Bibles
    trg: en-WMT2020+Bibles
  - type: val
    src: de-newstest2014_ende
    trg: en-newstest2014_ende
  - type: test
    src: de-newstest2017_ende
    trg: en-newstest2017_ende
  seed: 111
  share_vocab: false
  src_casing: lower
  src_vocab_size: 32000
  trg_casing: preserve
  trg_vocab_size: 32000
params:
  coverage_penalty: 0.1
  word_dropout: 0
train:
  keep_checkpoint_max: 5
  max_step: 1000000
  sample_buffer_size: 10000000
eval:      
  steps: 10000
  export_on_best: bleu
  early_stopping: null 
  export_format: checkpoint
  max_exports_to_keep: 100

Preprocessing.

The configuration file will be read by the preprocessing and the training part of the silnlp pipeline. During preprocessing the source and target files will be read, or an error presented if any of them can't be found. The SIL_NLP_DATA_PATH environment variable must be set to point to the root folder, and the source and target files must be in one of these two subfolders:

$SIL_NLP_DATA_PATH/MT/corpora
$SIL_NLP_DATA_PATH/MT/scripture

See this page for details about the required folder structure. Once the files have been read SentencePiece will be run to tokenize the source and target data. If it is unable to create a vocabulary for either of those of the size specified it will raise a Vocabulary size too high error. Once SentencePiece has created the vocabularies (i.e. list of tokens) to be used they will be saved in files with these names:

src-onmt.vocab
src-sp.model
src-sp.vocab
trg-onmt.vocab
trg-sp.model
trg-sp.vocab

Then files required for training and validation will be tokenized using those SentencePiece models and the files written to the experiment folder. These are named:

train.src.txt
train.trg.txt
val.src.txt
val.trg.txt

For the test set different files are written because the detokenized version of the target file is required for calculating scores.

test.src.txt
test.trg.detok.txt

It is the seed that is used to allocate the specified number of verses or lines to each of the Validation and Test set files. The remainder will be placed in the Training set. For Scripture files the verse references used for each set are stored in these files:

train.vref.txt
val.vref.txt
test.vref.txt

This facilitates the testing of the same selection of verses in different experiments producing test results that are not affected by the random selection of test verses. In that way we can be sure that differences in test results not due to the differences in the choice of verses for the test set.

The Effective Config file.

The effective config file is created as soon as the training begins. A good way to learn about all the default parameters is to compare a simple config file like this one to the effective config that it creates. Although there may be more than 100 parameters in the effective config file they all have default values. Typically we've found very few areas where we can get better results by changing a default value. They have been the subject of many experiments and are chosen by the OpenNMT project according to the results of the latest research.

Parameter Definitions

Definitions of every configurable experiment parameter and their default values. Information about Hugging Face parameters can be found here. Selected HF parameters are defined below for convenience, and default values are only given if they are explicitly defined in silnlp.

Data

  • add_new_lang_code=True: Add any language codes in language_codes to the tokenizer if they do not already exist.
  • aligner="fast_align": Aligner to use.
  • corpus_pairs:
    • augment=[]: List of data augmentation methods and their arguments to apply to the data. See example below.
      augment:
      - subword:
        - encodings: 2
      
    • corpus_books=[]: Books to be included in the dataset. See Selection of books or chapters for training on Scripture data.
    • disjoint_test=False: Use the same test set across all corpus pairs to ensure no overlap between any train and test sets.
    • disjoint_val=False: Use the same evaluation set across all corpus pairs to ensure no overlap between any train and evaluation sets.
    • lexical=False: Whether data is made up of lexical items rather than sentences.
    • mapping="one_to_one": How to map sources to targets. Options are one_to_one, mixed_src, or many_to_many. See Using Multiple Sources.
    • score_threshold=0.0: If <1, it is the minimum alignment score sentence pairs must have to be included in the training data. If >=1, that number of training sentence pairs with the lowest alignment scores will be filtered out of the training data.
    • size=1.0: Size of training split. If size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
    • src: Required argument. List of sources.
    • src_noise=[]: List of noise-adding methods and their arguments to apply to source sentences. See example below.
      src_noise:
      - dropout: .1
      - replacement: [.1, <blank>]
      - permutation: 2
      
    • tags=[]: Tags to prefix to each source sentence.
    • test_books=[]: Books to be included in the test set. See Selection of books or chapters for training on Scripture data.
    • test_size=250: Size of test split. If test_size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
    • trg: Required argument. List of targets.
    • type="train,test,val": What the data in the corpus pair will be used for. Possible values are any combination of train, test, and val.
    • use_test_set_from="": Use the set of verses in the given experiment's test set for this experiment.
    • val_size=250: Size of evaluation split. If val_size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
  • lang_codes: Mapping of ISO language codes to their NLLB equivalents for each language included in the data. See example below.
    lang_codes:
      en: eng_Latn
      npi: npi_Deva
    
  • mirror=False: Add mirrored data to the dataset, where the source and target are flipped.
  • seed=111: Seed for random verse selection. See A note about the seed parameter.
  • share_vocab=False: Use the same vocab file for the source and target languages.
  • stats_max_size=100000: Maximum number of sentence pairs allowed for a stats file to be generated.
  • terms:
    • categories="PN": Which categories of key terms to include.
    • dictionary=False: Write dictionary with key terms.
    • include_glosses=True: Include glosses of key terms.
    • train=True: Train on key terms data.
  • tokenize=True: Tokenize data.
  • tokenizer:
    • update_src=False: Update the tokenizer for the source language.
    • update_trg=False: Update the tokenizer for the target language.
    • trained_tokens=False: If True, train a new tokenizer on the source and/or target (specified by the update_src and update_trg parameters) to obtain trained tokens tailored to the source and/or target. All of the resulting tokens that are not present in the existing tokenizer are then added to the existing tokenizer. If False, only individual characters that are present in the source and/or target text and not present in the existing tokenizer will be added to the existing tokenizer, rather than trained tokens.
    • src_vocab_size: Only applicable if update_src and trained_tokens are True. This sets the vocab size for the new tokenizer for the source side. There is no default value, so it must be explicitly specified when update_src and trained_tokens are True.
    • trg_vocab_size: Only applicable if update_trg and trained_tokens are True. This sets the vocab size for the new tokenizer for the target side. There is no default value, so it must be explicitly specified when update_trg and trained_tokens are True.
    • share_vocab=False: Only applicable if update_src, update_trg, and trained_tokens are True. Rather than create new tokenizers for the source and target separately, use a single new tokenizer for both the source and target combined with a vocab size of src_vocab_size + trg_vocab_size.
    • init_unk=False: Initialize new token embeddings using the embedding for the unk token rather than using the model's default initialization behavior.

Eval

HF Arguments: eval_accumulation_steps, eval_delay, eval_steps=1000, evaluation_strategy="steps", greater_is_better, include_inputs_for_metrics, load_best_model_at_end=True, metric_for_best_model="bleu", per_device_eval_batch_size=16, predict_with_generate=True

  • eval_steps=1000: Number of update steps between two evaluations if evaluation_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.

Other Arguments:

  • detokenize=True: Detokenize verses before computing metrics during evaluation/testing.
  • early_stopping:
    • min_improvement=0.2: How much the metric_for_best_model metric must improve for training to continue.
    • steps=4: The amount of times in a row that an evaluation can improve by less than min_improvement before training is stopped.
  • multi_ref_eval=False: Evaluate outputs against multiple targets.

Infer

  • infer_batch_size=16: Batch size for inference.
  • num_beams=2: Number of beams for beam search during translation.

Model

model: Required argument. Name of base model to be used. Defined at the top level of the config, i.e. at the same level as data, eval, etc..

Params

HF Arguments: adafactor, adam_beta1, adam_beta2, adam_epsilon, full_determinism, generation_max_length, generation_num_beams, label_smoothing_factor=0.2, learning_rate, lr_scheduler_type, max_grad_norm, optim="adamw_torch", warmup_ratio, warmup_steps=4000, weight_decay

Other Arguments:

  • activation_dropout=0.0: Dropout rate for activation layers.
  • attention_dropout=0.1: Dropout rate for attention layers.
  • dropout=0.1: Dropout rate for all other layers.

Train

HF Arguments: gradient_accumulation_steps=4, gradient_checkpointing=True, group_by_length=True, log_level="info", logging_dir, logging_first_step, logging_nan_inf_filter, logging_steps, logging_strategy, max_steps=100000, num_train_epochs, output_dir=str(exp_dir / "run"), per_device_train_batch_size=16, save_on_each_node, save_steps=1000, save_strategy="steps", save_total_limit=2

  • gradient_accumulation_steps=4: Number of updates steps to accumulate the gradients for before performing a backward/update pass.
  • gradient_checkpointing=True: Use gradient checkpointing to save memory at the expense of slower backward pass.
  • logging_steps=500: Number of update steps between two logs if logging_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
  • max_steps=100000: The total number of training steps to perform. For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached. Overrides num_train_epochs. Set to -1 to instead use num_train_epochs.
  • num_train_epochs=3.0: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
  • per_device_train_batch_size=16: The batch size per GPU core/CPU for training.
  • save_steps=1000: Number of updates steps before two checkpoint saves if save_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.

Other Arguments:

  • delete_checkpoint_optimizer_state=True: Delete optimizer state from every saved checkpoint after training.
  • delete_checkpoint_tokenizer=True: Delete tokenizer from every saved checkpoint after training.
  • lora_config: Optional configuration for LoRA. See Common LoRA Parameters in PEFT.
    • alpha=32: Value for lora_alpha. "The alpha parameter for Lora scaling."
    • dropout=0.1: Value for lora_dropout. "The dropout probability for Lora layers."
    • modules_to_save: "List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint." Default value depends on the model being trained, but it normally only includes "embed_tokens".
    • r=4: "Lora attention dimension."
    • target_modules: "The names of the modules to apply Lora to." Default value depends on the model being trained, but it normally includes all linear layers.
  • max_source_length=200: Maximum length of a source segment. Segments longer than this value are truncated.
  • max_target_length=200: Maximum length of a target segment. Segments longer than this value are truncated.
  • use_lora=False: Train model using LoRA through the peft library. See here for more information.

Helpful Paramaters for Development

The following are some parameters that can be useful to change when running experiments for the purpose of testing during development. This is mostly to reduce training time while still making sure each part of the process is run.

eval: 
  eval_steps
  per_device_eval_batch_size
infer: 
  infer_batch_size
params:
  warmup_steps
train:
  max_steps
  num_train_epochs
  per_device_train_batch_size
  save_steps
  • save_steps determines how often a model checkpoint is saved during training. For example, if you wanted to quickly get a model to inference with, you could set both max_steps and save_steps to 100.

How to configure translation requests for a model.

Using the --translate option when running an experiment allows drafts to be created immediately following the training of a model. The configuration for each transalation request must be specified in translate_config.yml in the experiment folder. The behavior of this process is identical to using the translate.py script, and so the possible arguments for a configuration match the command line options of the script (With the exception of the memory_growth, eager_execution, clearml_queue, and debug options). The format of translate_config.yml is a list of dictionaries, where each dictionary represents a translation request. See example below, as well as the translate.py usage documentation for descriptions of the arguments.

translate:
- books: 1JN
- src_project: NASB
  trg_project: NNRV
  books: 1JN1-2;2JN
  • In this example, the first request will translate 1 John from the experiment's source project to the target language. The second request will translate the specified chapters in the NASB to the target language, filling in incomplete books with text from the NNRV.