Sense disambiguation of discourse connectives for PDTB-style shallow discourse parsing.
This package provides core functionality for sense disambiguation of explicit and implicit discourse connectives for PDTB-like discourse parsing. It was created for the CoNLL-2016 shared task.
The main package dsenser currently comprises the following classifiers which can be trained either individually or bundled into ensembles:
- dsenser.major.MajorSenser
- a simplistic classifier which returns conditional probabilities of senses given a connective;
- dsenser.wang.WangSenser
- an optimized reimplementation of Wang et al.'s sense classification system using the LinearSVC classifier;
- dsenser.xgboost.XGBoostSenser
- an optimized reimplementation of Wang et al.'s sense classification system using the XGBoost decision forrest classifier;
- dsenser.svd.SVDSenser
- a neural network classifier which uses the SVD decomposition of word embedding matrices of the arguments;
- dsenser.lstm.LSTMSenser
- a neural network classifier which uses an LSTM recurrence with Bayesian dropout (cf. Yarin Gal, 2016).
To install this package, you need to checkout this git-project with its submodules by subsequently running the following commands:
# initialize the git project
git clone git@github.com:WladimirSidorenko/DiscourseSenser.git
cd DiscourseSenser
git submodule init
git submodule update
# download the skip-gram word embeddings and store them to `dsenser/data/`
wget http://angcl.ling.uni-potsdam.de/data/GoogleNews-vectors-negative300.bin.gz -O \
dsenser/data/GoogleNews-vectors-negative300.bin.gz
gunzip dsenser/data/GoogleNews-vectors-negative300.bin.gz
# download the pre-trained models and store them to `dsenser/data/models`
wget http://angcl.ling.uni-potsdam.de/data/pdtb.models.tgz
tar -xzf pdtb.models.tgz -C dsenser/data/models
# Beware, since this package is constantly being improved, the
# most recent version might not be fully compatible in terms of
# features with the models we trained for the submission. In
# this case, we recommend you check out our evaluated version
# by running the following command:
# ``git checkout conll-asterisk-evaluation``
# finally, install the package in an editable mode (no copying will be
# required in this case)
pip install --user -r requirements.txt -e .
To ease the installation process, we are currently working on creating
a wheel for this package, but are facing some problems due to the
large size of the included word embedding file which requires the
zip64
extension.
After installation, you can import the module in your python scripts, e.g.:
from dsenser import DiscourseSenser
...
senser = DiscourseSenser(None)
senser.train(train_set, dsenser.WANG | dsenser.XGBOOST | dsenser.LSTM,
path_to_model, dev_set)
or, alternatively, you can also use the delivered front-end script
pdtb_senser
to process your input data, e.g.:
pdtb_senser train --type=2 --type=8 path/to/train_dir
pdtb_senser test path/to/input_dir path/to/output_dir
The data in the specified folders should be in the ConNLL JSON format,
and include the files parses.json
and relations.json
for
training, and parses.json
and relations-no-senses.json
for the
testing mode. Alternatively, you can also specify a different input
relations file whose senses need to be predicted by using the option
pdtb_senser test --rel-file=REL_FILE INPUT_DIR OUTPUT_DIR
.
In order to reproduce our *asterisk results from the CoNLL Shared Task submission, you need to repeat the steps described in Section Installation, but additionally run the checkout command to obtain exactly the version that we were using for the evaluation:
git checkout conll-asterisk-evaluation
We gratefuly acknowledge the contribution of
- Tatjana Scheffler who extended the original features of Wang et al.