This repo contains the code to reproduce the results reported in the paper An Empirical Investigation of Beam-Aware Training in Supertagging to appear in EMNLP Findings 2020. This work explores how different choices for the meta-algorithm of Negrinho et al (2018), which appeared in NeurIPS 2018, affect performance in a sequence labelling task (namely, supertagging on CCGBank). The goal of this work was to explore when beam-aware training algorithms would soundly beat non-beam aware methods (e.g., the default approach of training on maximum likelihood and decoding with beam search). We have found several conditions under which this is the case, e.g., in a simulated online setting where the model does not have access to the complete sentence for tagging and therefore must manage uncertainty about prediction effectively. It is in these cases where we observe the largest performance differences to models that are not trained in a beam-aware manner, and therefore are bound to make unrecoverable mistakes resulting from their greediness. By learning the model in a beam-aware manner, the model is able to learn to use the beam to manage uncertainty about future predictions until there is additional information to resolve the uncertainty.
First, create a Conda environment to work on the project:
conda create --name beam_learn python=2.7
conda activate beam_learn
python -m pip install dynet==2.1
conda install psutil matplotlib paramiko
main.py
is the main file containing the implementations of the algorithms.
main.py
is ran with a JSON configuration file.
See below for the command to run a specific configuration file for training (--train
flag; see main.py
for other options, such as --compute_vanilla_beam_accuracy
and --compute_beam_accuracy
which are used to run vanilla beam search on a model trained with maximum likelihood, and to run beam search on a model that has been trained in a beam-aware manner, respectively).
python -u main.py --dynet-mem 4000 --dynet-autobatch 1 --train --config_filepath PATH_TO_CONFIGURATION_FILE
The training data must first be processed to the format expected by the code.
First download CCGBank) from LDC (which requires access to LDC corpora, which your university might have a subscription for).
After downloading the files, uncompress them into the folder data/ccgbank_1_1
.
After it is placed there, main_preprocessing.py
can be ran to generate the data files needed for running the training code (i.e., data/supertagging/train.jsonl
, data/supertagging/dev.jsonl
, and data/supertagging/test.jsonl
).
See here for CONLL-2003 processed into this format for an example of how the resulting files should look like (due to licensing restrictions for the supertagging data).
While the code was developed for supertagging, it should be easy to adapt for any sequence labelling task where the input and output sequences have the same length.
The easiest way of accomplishing this is to process the data into the JSON line format (jsonl) which is used for the supertagging task.
We have included data processing scripts for CONLL-2000, CONLL-2003, and PTB in dev/main_preprocessing.py
.
After the data is in place, the only step left to run main.py
is to generate the JSON configuration files that were used for the experiments in the paper.
These configuration files will live in the configs
folder.
The configuration files for the experiments in the paper are derived from a base configuration file configs/cfgref.json
to reduce the amount of repetition and to make clear what aspects are being tested.
The contents of that file are:
{
"model_type": "vaswani",
"w_emb_dim": 64,
"t_emb_dim": 64,
"pos_emb_dim": 16,
"use_postags": 1,
"bilstm_h_dim": 256,
"lm_h_dim": 256,
"num_epochs": 16,
"step_size_schedule_type": "cosine",
"step_size_start": 0.1,
"step_size_end": 1e-5,
"weight_decay": 0.0,
"use_beam_bilstm": 0,
"use_beam_mlp": 0,
"accumulate_scores": 1,
"update_only_on_cost_increase": 0,
"print_every_num_examples": 8192,
"data_type": "supertagging",
"use_pretrained_embeddings": 0,
"loss_type": "log_neighbors",
"compute_train_acc": 1,
"debug": 0,
"num_debug": 1024,
"optimizer_type": "sgd",
"out_folder": "out/cfgref",
"beam_size": 1,
"traj_type": "continue"
}
This is the only config file that has been provided under version control in the repo.
The other config files are derived from this one through overlays.
These can be generated by running main_experiments.py
.
For example, configs/cfg3000.json
, which is one of these generated files, is as follows:
{
"loss_type": "log_neighbors",
"data_type": "supertagging",
"_overlays_": [
"configs/cfgref.json"
],
"out_folder": "out/cfg3000",
"beam_size": 1,
"traj_type": "continue",
"model_type": "vaswani"
}
The overlays are specified through the list with key _overlays_
, which has a single one in this case.
Multiple repeats of the same configuration are achieved by having additional config files that overlay this config by only changing the out_folder
, i.e., the folder to which the result of running the configuration file will be stored.
For example, configs/cfg_r0_3000.json
for the first repetition of configs/cfg3000.json
:
{
"out_folder": "out/cfg_r0_3000",
"_overlays_": [
"configs/cfg3000.json"
]
}
For the results in the paper, we used three repetitions.
The results of running of a training experiment are stored in the out_folder
mentioned in the corresponding config, with the results being stored in a JSON file with various metrics for each epoch (e.g., secs_per_epoch
, train_acc
, and dev_acc
).
The relevant files to check in this case are checkpoint.json
(regenerated at the end of each epoch) and results.json
(created at the end of training).
utils.sh
is used to help offload the computation of these configs to a remote server.
In our workflow, we have used a SLURM managed cluster (namely, Bridges).
Using this code for Bridges with your own account or for another SLURM managed cluster should be a manner of changing the credentials in the file.
These utilities work best with an SSH key, which removes the need to input the password with each connection to the server.
Finally, after having all the results for the configs of the form configs/cfg_r*_*.json
, the results reported in the paper can be generated by running main_results.py
.
In summary, the steps to replicate the results in the paper are:
- Create the Conda environment with the required packages.
- Download CCGBank from LDC, uncompress it, and place it in
data/ccgbank_1_1
. - Process raw CCGBank data by running
main_preprocessing.py
, which will create new files in adata/supertagging
folder. - Generate the JSON configuration files by running
main_experiments.py
, which will create new files indata/configs
. - Run desired configuration files as described above, which will place the results in
out/$NAME_OF_CONFIG
. - After all the desired experiments are finished, the results of the paper can be computed by running
main_results.py
, which assumes that the relevant log files are in theout
folder.
While both the configs and the results can be generated by running the code as described, the configs (which can be generated by running main_experiments.py
) and the results (which can be generated with main_results.py
after running all the configs with main.py
) can be found for reference here and here, respectively.
If you use this code or build on the results of the this paper, please consider citing:
@inproceedings{negrinho2020empirical,
title={An Empirical Investigation of Beam-Aware Methods in Supertagging},
author={Negrinho, Renato and Gormley, Matthew and Gordon, Geoffrey},
booktitle={EMNLP Findings},
year={2020}
}
@inproceedings{negrinho2018learning,
title={Learning beam search policies via imitation learning},
author={Negrinho, Renato and Gormley, Matthew and Gordon, Geoffrey},
booktitle={Advances in Neural Information Processing Systems},
year={2018}
}
We gratefully acknowledge support from 3M | M*Modal. This work used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).