Please use the following citation:
@InProceedings{W18-3602,
author = "Puzikov, Yevgeniy and Gurevych, Iryna",
title = "BinLin: A Simple Method of Dependency Tree Linearization",
booktitle = "Proceedings of the First Workshop on Multilingual Surface Realisation",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "13--28",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/W18-3602"
}
Abstract:
Surface Realization Shared Task 2018 is a workshop on generating sentences from lemmatized sets of dependency triples. This paper describes the results of our participation in the challenge. We develop a data-driven pipeline system which first orders the lemmas and then conjugates the words to finish the surface realization process. Our contribution is a novel sequential method of ordering lemmas, which, despite its simplicity, achieves promising results. We demonstrate the effectiveness of the proposed approach, describe its limitations and outline ways to improve it.
Contact person: Yevgeniy Puzikov, puzikov@ukp.informatik.tu-darmstadt.de
https://www.ukp.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
- Official website: http://taln.upf.edu/pages/msr2018-ws/
- Track: Shallow
- Informal task description: given a lemmatized dependency tree, generate a sentence from it.
- Evaluation protocol:
- automatic metrics (BLEU, NIST, CIDEr, normalized edit distance)
- human evaluation by preference judgments
The repository has the following structure:
run_experiment.py
: main script to run- sample configuration files to use with the script above
- settings for the syntactic ordering component:
en_syn-config.yaml
(SynMLP) - settings for the morphological inflection generation component:
en_morph-mlp-config.yaml
(MorphMLP)en_morph-rnn-soft-config.yaml
(MorphRNNSoft)en_morph-rnn-hard-config.yaml
(MorphRNNHard)
- settings for the syntactic ordering component:
components/
: NN components and utility functionsbaselines/
: scripts to run baseline models
- 64-bit Linux versions
- Python 3 and dependencies:
- PyTorch v0.3.1
- Progressbar2 v3.18.1
- Matplotlib v2.2.2
- NLTK v3.3
-
The code was developed and tested using an Anaconda environment. Install Anaconda on your machine, create an environment (e.g., 'py3.6') and install Python3 dependencies:
$ conda install -c anaconda -n py3.6 numpy pyyaml mkl mkl-include setuptools cmake cffi typing $ conda install -c anaconda -n py3.6 nccl pytorch cudnn cudatoolkit $ conda install -c anaconda -n py3.6 progressbar2 nltk
The repository contains four template configuration files (*.yaml
)
for training neural models and using them later for prediction.
Before running anything:
-
Revise the configuration files -- set the paths and parameter values!
-
Navigate to
./components/data/morph_align/
and run:$ make all
This will compile the source code for the Chinese Restaurant Process string pair aligner (reused from here).
-
Run the following command:
$ python run_experiment.py -m train -c some_config.yaml
-
After the experiment, a folder will be created under the directory specified by the experiments_dir field of
my_config.yaml
file. This folder should contain the following files:- experiment log (
train.log
) - best model weights (
weights.epochXX_*
, where XX stands for epoch number and * shows the approximate performance of the model) - development set predictions for each training epoch (
predictions.epochX
) - serialized vocabulary used to map inputs to numerical IDs
- a csv file with scores and train/dev losses for each epoch (
scores.csv
) - configuration dictionary in json format (
config.json
)- pdf files with learning curves
- experiment log (
-
Stage-wise prediction is done using the following command:
$ python run_experiment.py -m predict -c some_config.yaml
Do not forget to specify the model path in the model_fn field of the config file. The predictions done by the loaded model will be stored in
/path/to/model_fn.dev.STAGE.predictions
. Here STAGE can be either morph or syn, depending on the value of the field stage in the config file. -
Full pipeline prediction is done using the following command:
$ python run_experiment.py -m pipeline -c syn_config.yaml morph_config.yaml -o output_file
The predictions done by the pipeline model will be stored as: -
/path/to/output_file.dev.final.txt
(for the dev data specified in the configuration file) -/path/to/output_file.test.final.txt
(for the test data specified in the configuration file)
We implemented three baselines:
- morph_lemma: LEMMA baseline (morphological inflection generation component)
- morph_major: MAJOR baseline (morphological inflection generation component)
- syn_random: RAND baseline (syntactic ordering component)
To run each baseline, the following steps should be performed:
- Make a folder to store the development set files (e.g.,
/path/to/dev_refs
) - Make a folder to store baseline predictions (e.g.,
/path/to/dev_hyp
)
To make predictions using one of the baselines, run the following command:
$ python BASELINE.py /path/to/dev_refs /path/to/dev_hyp
Here, BASELINE.py stands for one of the following Python scripts:
./baselines/morph_lemma.py
./baselines/morph_major.py
./baselines/syn_random.py
The official evaluation scripts can be found on the workshop webpage. Shortcut instructions:
- Put all references into
A/
folder, all predictions intoB/
folder. - Make sure the filenames in both
A/
andB/
are the same. - Run the following command:
$ python eval_Py3.py /path/to/A /path/to/B
Note: make sure your NLTK package is up-to-date!
If it is not, NIST scores will be wrong.