Skip to content

Training Configuration Files

Katharina Stein edited this page Feb 3, 2023 · 1 revision

Run Training

python training.py --config [path_to_config_file]

where the config file should be a .json file including the configuration for the training following the format of the files in the training_configs folder.

Example configuration file:

{"gen_args": {
    "model_name_or_path": "t5-base",
    "tokenizer_name_or_path": "t5-base",
    "corpus_dir": "./data/ara1_amrs",
    "train_path": "train",
    "valid_path": "val",
    "max_in_len": 0,
    "max_out_len": 1024,
    "context_len": 1,
    "linearization": "penman",
    "sep_token": "<GRAPH>",
    "dropout_rate": 0.1
  },
    "train_args": {
      "output_dir": "./models/train_t5_ara1_amr/t5_ara1_amr",
      "do_train": true,
      "do_eval": true,
      "predict_with_generate": true,
      "generation_max_length": 1024,
      "evaluation_strategy": "steps",
      "eval_steps": "step",
      "overwrite_output_dir": false,
      "num_train_epochs": 200,
      "save_strategy": "steps"
      "save_steps": 46,
      "save_total_limit": 2,
      "per_device_train_batch_size": 6,
      "per_device_eval_batch_size": 24,
      "gradient_accumulation_steps": 4,
      "learning_rate": 1e-4,
      "seed": 42,
      "log_level": "info",
      "logging_strategy": "steps",
      "logging_steps": 46,
      "remove_unused_columns": false,
      "no_cuda": false,
      "load_best_model_at_end": true,
      "metric_for_best_model": "eval_bleu"
  }
}

The key, value pairs in the scope of "gen_args" are needed to specify the following information:

  • "model_name_or_path": path to a the folder of a local model or name of a huggingface model of the type T5ForConditionalGeneration
  • "tokenizer_name_or_path": path to the folder containing a trained tokenizer or name of a huggingface tokenizer of type T5Tokenizer; optional, if not provided the same as "model_name_or_path" will be used for loading the tokenizer
  • "corpus_dir": path to corpus directory, relative to training.py
  • "train_path": path to the file with the complete training data or to a folder with several files for training; path is relative to "corpus_dir"
  • "valid_path": path to the file with the complete validation data or to a folder with several files for validation; path is relative to "corpus_dir"
  • "max_in_len": maximum input length; tokenizer will truncate longer input sequences
  • "max_out_len": maximum output length; tokenizer will truncate longer target sequences
  • "context_len": number of previous sentences of the same document to preprend to the current input graph
  • "linearization": type of linearization to use for the amr graph;
  • "sep_token": the special token that should be added between the current graph and the previous context; will be added as special token to the vocab of the tokenizer
  • "dropout_rate": the dropout rate to use; optional, defaults to 0.1

"train_path" / "valid_path"
If "train_path" / "valid_path" is a directory, then each file in the directory is treated as one document if context_len > 0. If "train_path" / "valid_path" is a file, then that file is treated as one single document if context_len > 0.

train_args
The "train_args" dictionary will be converted into a TrainingArguments object and passed to the transformer Seq2SeqTrainer. See the Seq2SeqTrainerArguments documentation for information about possible parameters and default values.

linearization
Currently implemented are two options:

  • 'penman': does not make any changes to the input format, i.e. is the same penmanr string representation as in the input files (without the metadata)
  • 'penman_wo_alignments': removes the node-to-token alignments from the amr string (i.e. removes all '~e.X' occurences where X is the aligned token ID)

"max_in_len"/"max_out_len"
If no limitation and truncation of the input / output sequence should happen, then set the corresponding value to 0. Sequences that get truncated are removed from the training data set, i.e. if an input sequence or an output sequences exceeds the maximum length, then that input/output pair is removed from the data set. Note: not restricting the output length during generation will negatively impact your generated texts. If "do_eval" is set to true, "max_out_len" should therefore be != 0.

Important: do not change "remove_unused_columns" to true or the functions will not work any more (see here for more information)

Clone this wiki locally