This project trains a part-of-speech tagger and dependency parser for a
low-resource language such as Tagalog. We will be using the
TRG and
Ugnayan
treebanks for this task. Since the number of sentences in each corpus is
small, we'll need to evaluate our model using 10-fold cross
validation.
How to implement this split will be demonstrated in this project
(scripts/kfold.py
). The cross validation results can be seen below.
TOKEN_ACC | POS_ACC | MORPH_ACC | TAG_ACC | DEP_UAS | DEP_LAS | |
---|---|---|---|---|---|---|
TRG | 1.000 | 0.843 | 0.749 | 0.833 | 80.846* | 0.554 |
Ugnayan | 0.998 | 0.819 | 0.995 | 0.810 | 0.667 | 0.409 |
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
preprocess |
Convert the data to spaCy's format |
evaluate-kfold |
Evaluate using k-fold cross validation |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
preprocess → evaluate-kfold |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/tl_trg-ud-test.conllu |
URL | Treebank data for UD_Tagalog-TRG |
assets/tl_ugnayan-ud-test.conllu |
URL | Treebank data for UD_Tagalog-Ugnayan |