Collaborators: Connor Boyle, Martin Horst, Nikitas Tampakis
This code can be run most conveniently through our Google Colab notebook.
The cached models for D4 will remain available for a short period on the UW Google Drive. To download the full 700+ MB trained models, you can use this link.
You can train a model using the following command (after activating the correct environment):
$ python src/train.py --train-file <TRAIN_FILE> --model-directory <MODEL_DIRECTORY>
replacing <TRAIN_FILE>
with the path to the training data and <MODEL_DIRECTORY>
with the path to where the model
checkpoints will be saved.
The above command will train using the default hyperparameters for our training loop. It will also use a random 10% of
the training data in <TRAIN_FILE>
file as a per-epoch validation dataset.
You can train a model using the following command (after activating the correct environment):
$ python src/train.py --train-file <TRAIN_FILE> --dev-file <DEV_FILE> --model-directory <MODEL_DIRECTORY>
replacing <TRAIN_FILE>
with the path to the training data, <DEV_FILE>
with the path to the dev dataset file, and
<MODEL_DIRECTORY>
with the path to where the model checkpoints will be saved.
The above command will train using the default hyperparameters for our training loop. It will also use <DEV_FILE>
as a
per-epoch validation dataset.
These represent the maximum tokenized tweet lengths from the BERT tokenizer for our train, dev, and test files for Spanglish and Hinglish. E.G.: r, ##t, @, fra, ##lal, ##icio, ##ux, ##xe, t, ##bh, i, have, bad, sides, too, ., when, i, say, bad, it, ', s, ho, ##rri, ##bly, bad, .
- Spanglish_train.conll: 82
- Spanglish_dev.conll: 78
- Spanglish_test_conll_unlabeled.txt: 76
- Hinglish_train_14k_split_conll.txt: 85
- Hinglish_dev_3k_split_conll.txt: 70
- Hinglish_test_unlabeled_conll_updated.txt: 77
The classifier can be run from the shell with the following command:
$ python src/classify.py --test-file <TEST_FILE> --model-directory <MODEL_DIRECTORY>/<MODEL_INSTANCE> --output-file <OUTPUT_FILE>
replacing <TEST_FILE>
with the path to a testing data file (
e.g. data/Semeval_2020_task9_data/Spanglish/Spanglish_test_conll_unlabeled.txt
)
and <OUTPUT_FILE>
with the path to an output file (e.g. output.txt
)
<MODEL_DIRECTORY>
with the path to the directory where trained model instances have been saved, and <MODEL_INSTANCE>
with any of the (0-indexed) model instances saved before each epoch, or the FINAL
model instance saved after the last
epoch (NOTE: in our submitted, trained models, only the FINAL
model instance has been saved).
We load and save our base Python environment using Conda. You can load the environment for the first time by running the following command from the root of the repository:
$ conda env create -f=src/environment.yml
You can then activate the base environment with the following command:
$ conda activate 573
To update your current base environment with a new or changed environment.yml
file, run the following command:
$ conda env update -f=src/environment.yml
On top of the base environment, you will need to install package dependencies from requirements.txt
(make sure you have activated the base environment you want to use):
$ pip install -r src/requirements.txt