Recent years have seen the development of statistical approaches employing sequence-to-sequence neural models for Programming-by-example (PbE) systems that capture user intent expressed through input-output examples adhering to pre-defined specifications. Although these approaches require less domain-specific knowledge and no hand-engineered rules, they are plagued with three main issues - (i) Program Aliasing - multiple semantically correct programs which are penalized during training against reference ground-truth programs (ii) Objective Mismatch - training the model performance is evaluated against the reference ground-truth program while during inference, the transformed output string is evaluated for correctness, and (iii) Exposure Bias - during training, the input to the decoder network is from ground-truth tokens while during the inference stage, it is from the learned model distribution. To address these issues, we propose to utilize policy gradients in a Reinforcement Learning (RL) framework, which is initialized through cross-entropy loss to ensure the model starts with meaningful policy instead of a random one. We discuss problem formulation the RL network architecture and show significant improvements on generalisability for both unseen function sequences and for unseen arguments in string manipulation tasks.
Following is the help guide to use the Reinforcement Learning framework to improve Deep Learning based statistical models for program generation.
[SyntheticD]: (Will add) [RealD]: (Will add)
Stored at [SyntheticD]: It contains synthetically generated datasets to train PbE systems for string transformation. The dataset is generated as per the procedure described in the paper i.e. for each transformation program, we sample 10 input-output example pairs where m
number are used for generating the transformation program and 10-m
are used to evaluate the program. Here is the folder hierarchy and its interpretation:-
train_500k.json
: Contains 500k samples (program, input examples and corresponding outputs) for model training.val_50k.json
: Contains 50k samples for model evaluation.Val_50k_curate
: Contains splits ofval_50k.json
where we capture the overlap in the transformation program seen by the model during training:onlyFunctionNames
: Overlap/Partial/nonoverlap splits ofval_50k.json
where we match only the function names of the transformation programs with those fromtrain_500k.json
.nonModified
: Overlap/Partial/nonoverlap splits ofval_50k.json
where we match the function names and their arguments of the transformation programs with those fromtrain_500k.json
.Modified
: Ignore
Test_Curated
: A subset of Val_50k_curated which contains three splits only where we want to evaluate the model based on the following criterion:- transformation programs seen by the model during training:
nonModified_overlap.json
- function sequence seen but arguments unseen:
onlyFunctionNames_overlap.json
- function sequences unseen:
onlyFunctionNames_nonOverlap.json
- transformation programs seen by the model during training:
Stored at [RealD]: The flash-fill dataset contains 335 real-life string manipulation tasks containing both logical and non-logical tasks. We shortlist 142 non-logical tasks and specify it in the file no_logic_tasks.json
, which contains name and ID of each selected task. Additionally, we separately store 77 non-logical tasks which have at least 10 input-output example pairs in the file no_logic_tasks_10_samples_each.json
. Here is the folder hierarchy and its interpretation: -
FlashFill_orig_CSV
: Folder containing CSV files for each flash-fill task, where each file has input-output example pairsFlashFill_orig_JSON
: Folder containing JSON files in which every flash-fill task, its input samples and output samples are stored for ready-to-use.flashFill_raw_no_empty_op.json
: Tasks do not contain input-output example pairs where the output is an empty string.flashFill_raw_with_empty_op.json
: Tasks contain input-output example pairs where the expected output is an empty string.
FlashFill_Curated_CSV
: Folder containing CSV file for each flash-fill task specified inno_logic_tasks.json
.FlashFill_Curated_JSON
: Folder containing JSON files in which only non-logical tasks, input samples and output samples are stored for ready-to-use.flashFill_raw_no_empty_op.json
: Contains non-logical tasks, and each task is without input-output example pairs where the output is an empty string.flashFill_10_samples_per_task.json
: Also contains non-logical tasks, but each task now has a fixed number of input-output example pairs. The task list is specified inno_logic_tasks_10_samples_each.json
Here, we provide instructions to train the model using both XE Loss and RE Loss.
Cross Entropy Loss (XE)
-
Install the dependencies.
$ pip install -r requirements.txt
-
Download the training-validation dataset from [here](Will add) into the
/data
folder and indicate their relative path in the config file (See Config).
{
"train_data_path": "./data/train_500k.json ",
"validation_data_path": "./data/val_50k.json",
}
- Indicate the model parameters in your configuration file
neural_architecture.jsonnet
(See [config.json](Will add)).
Reinforce Loss (RL)
- Train the model with XE Loss for few epochs.
{
"training_signal": "supervised"
}
-
Monitor the custom metrics and extract the trained weights for model with the best generalization accuracy. (Let's call it
best.th
) -
Paste the model under the alias
best.th
in the root directory (./
) containing model code. -
Indicate the
training_signal
asrl
and train the model with Reinforce Loss for few epochs (See [config.json](Will add)).
{
"training_signal": "rl"
}
And now here’s how you train the model using AllenNLP ():
$ allennlp train root_dir/neural_architecture.jsonnet -s root_dir/checkpoint --include-package root_dir/model
Instructions: There should be no pre-existing checkpoint
folder directory present at root_dir/checkpoint
. In case you want to resume model training from the previously saved checkpoint in your serialisation directory, use -r
flag in the command.
Tips: RL-based training is memory consuming, since it stores batch_size*nb_rollouts*max_decoding_steps
number of gradients before back-propagation. Decreasing the nb_rollouts
too much is not recommended because a higher value (64-100) is required for stable RL training. Also, max_decoding_steps
should roughly match the avg. program length (i.e. num of functions + their arguments) we allow the model to decode. Thus, only parameter we can play with is batch_size
.
Therefore, by default, we perform gradient accumulation for rl
and beam-rl
training (accumulate_gradients
in custom_trainer.py
is set to True
). The idea is, instead of doing the forward pass for complete batch at once, you do it in steps, accumulate the gradients for each step and then back-propagate. The parameter that we can play with is rl_inner_batch
which indicates inner batch size and it should perfectly divide batch_size. Note that rl_inner_batch
balances the trade-off between time-consumed and memory-consumed. If it is high, time-consumed for training will decrease, but memory consumption will increase and vice-versa when it is too low.
During testing, this is not the case. We use Beam-search to decode top-k programs, where k is set to 100 (See parameter val_beam_size
in program_decoder_rl.py
) and we allow for maximum of 50 decoding steps (See parameter val_max_steps in program_decoder_rl.py
). Max_decoding_steps during train and test time need not be the same. For later, I have manually set it to the aforementioned values in program_decoder_rl.py.
In the config file, you can tune the following parameters:
-
lazy: with
false
, the model will pre-load the whole dataset into memory. Usetrue
to conserve memory while data loading. -
training_signal: Method for training the model. It could be
supervised
for XE Loss,rl
for Reinforce training. -
rl_environment: Environment to generate rewards. It could be
BlackBoxConsistency
to generate positive reward when the program is correct on observed example-pairs andBlackBoxGeneralization
if program is correct on held-out example pairs. -
nb_rollouts: Number of programs to decode per set of observed example pairs. Keep it in between 50 and 100.
-
validation_metric: Metric to monitor while validating the model. Choose between
+acc_consistency
,+acc_generalisation
,-loss
. -
rl_inner_batch: Size of inner batch size for gradient accumulation. Note that
batch_size
should be divisible byrl_inner_batch
. This parameter will only be used for rl-based training. -
max_decoding_steps: Indicates for how many maximum number of time-steps, we are allowing decoding of program.
-
cuda_device: To choose between CPU/GPU. For CPU, use -1 and for GPU, use 0. Mulit-GPU training currently not supported.
-
Optimisation-based parameters (Do not set shuffle to
true
. Model is sensitive to order of training samples, they appear in):"trainer": { "shuffle": false, "optimizer": { "type": "adam", "lr": 0.001 }, "learning_rate_scheduler":{ "type": "step", "step_size": 4, "gamma": 0.5 }, "patience": 2, "num_epochs": 2
How it works? In each forward pass of the decoder:
- Collect observed example pairs and held-out example pairs per sample
- For each set of observed example pairs:
- Use Beam Search to decode top-k transformation programs
- For each predicted program:
- Decode the program into its executable form
- Match the predicted program with GT program (only if later is provided)
- Parse the predicted program to verify its syntax
- If syntactically correct, run the predicted program on observed example pairs to check its consistency
- If consistent, run the predicted program on held-out example pairs to check generalization
- For each predicted program:
- Use Beam Search to decode top-k transformation programs
- Return the computed metrics for the batch
We have provided some trained-models that can be used right out of the box.
Loss | m | Environment |
---|---|---|
[XE](Will add) | 1 | NA |
[XE](Will add) | 4 | NA |
[RL](Will add) | 1 | Generalisation |
[RL](Will add) | 4 | Consistency |
[RL](Will add) | 4 | Generalisation |
Show instructions
-
Download the best performing model ([RL](Will add)).
-
Declare the following paths in
evaluate.py
to load the downloaded trained model.
serialization_dir = "path_to_model_dir"
params_file = "path_to_model_dir/config.json"
- To generate the top-k programs with their ranks and scores, run
display_prog()
withfile_path
pointing to json file containing only one test sample. Note that the sample should have minimumm
number of input-output string pairs. For example (see below test sample), [RL](Will add) will usem = 4
examples to generate the transformation program andtotal_pairs-m = 1
will be used to evaluate it.
{
"input": [
"2001..11..6",
"1846..1..31",
"1712..11..2",
"2160..8..2",
"2072..9..26",
],
"output": [
"2001",
"1846",
"1712",
"2160",
"2072",
]
}
Run the following command on above test sample (if present in sample_test.json
):-
$ display_prog(file_path='./data/sample_test.json')
-
To evaluate the model on synthetic test dataset, declare the path i.e.
evaluation_data_path
and runevaluate_model_and_save()
. -
To evaluate the model on flash-fill benchmark, declare the path i.e.
flash_fill_eval_path
to json containing flash-fill tasks, declare the path toflash_fill_save_path
to save the results in json format and runevaluate_model_on_flash_fill()
.