Deep Active Learning for Morphophonological Processing
Building a system for morphological processing is a challenging task in morphologically complex languages like Arabic. Although there are some deep learning based models that achieve successful results, these models rely on a large amount of annotated data. Building such datasets, specially for some of the lower-resource Arabic dialects, is very difficult, time-consuming, and expensive. In addition, some parts of the annotated data do not contain useful information for training machine learning models. Active learning strategies allow the learner algorithm to select the most informative samples for annotation. There has been little research that focuses on applying active learning for morphological inflection and morphophonological processing. In this paper, we have proposed a deep active learning method for this task. Our experiments on Egyptian Arabic show that with only about 30% of annotated data, we achieve the same results as does the state-of-the-art model on the whole dataset.
If you find this work useful to you, please cite:
Seyed Morteza Mirbostani, Yasaman Boreshban, Salam Khalifa, SeyedAbolghasem Mirroshandel, and Owen Rambow. 2023. Deep Active Learning for Morphophonological Processing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 793–803, Toronto, Canada. Association for Computational Linguistics.
Deep Active Learning for Morphophonological Processing (Mirbostani et al., ACL 2023)
@inproceedings{mirbostani-etal-2023-deep,
title = "Deep Active Learning for Morphophonological Processing",
author = "Mirbostani, Seyed Morteza and
Boreshban, Yasaman and
Khalifa, Salam and
Mirroshandel, SeyedAbolghasem and
Rambow, Owen",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-short.69",
doi = "10.18653/v1/2023.acl-short.69",
pages = "793--803",
abstract = "Building a system for morphological processing is a challenging task in morphologically complex languages like Arabic. Although there are some deep learning based models that achieve successful results, these models rely on a large amount of annotated data. Building such datasets, specially for some of the lower-resource Arabic dialects, is very difficult, time-consuming, and expensive. In addition, some parts of the annotated data do not contain useful information for training machine learning models. Active learning strategies allow the learner algorithm to select the most informative samples for annotation. There has been little research that focuses on applying active learning for morphological inflection and morphophonological processing. In this paper, we have proposed a deep active learning method for this task. Our experiments on Egyptian Arabic show that with only about 30{\%} of annotated data, we achieve the same results as does the state-of-the-art model on the whole dataset.",
}
Clone the following repository to your local system:
$ git clone https://github.com/mirbostani/ArabicMorph
$ cd ArabicMorph
Create an environment based on the provided environment.yml
file to install the dependencies:
$ conda env create -f environment.yml
$ conda activate arabicmorph
Copy the required *.tsv
files from data/arabicmorph/Egy_Data/ECAL/baseline_data
to data/arabicmorph/active
.
$ cd data/arabicmorph/active
$ tree -L 1
.
├── 0_train_1000.tsv
├── 0_train_13171.tsv
├── 0_tune_1000.tsv
├── 0_tune_13171.tsv
├── dev_oov.tsv
├── dev.tsv
├── test_oov.tsv
└── test.tsv
0 directories, 8 files
Use the following bash script to train a transformer model with active learning.
$ cd arabicmorph
$ bash example/transformer/arabicmorph_al.sh
The content of the bash script is as follows:
python src/train_al.py \
--prefix "am0" \
--start_cycle 0 \
--num_train_samples 900 \
--num_tune_samples 500 \
--num_cycle_samples 250 \
--data_dir "data/arabicmorph/active" \
--checkpoint_dir "checkpoints/transformer" \
--dataset_files \
data/arabicmorph/active/0_tune_13171.tsv \
data/arabicmorph/active/0_train_13171.tsv \
--train_reference_files \
data/arabicmorph/active/0_train_1000.tsv
For more information, use --help
option.
$ python src/train_al.py --help
usage: train_al.py [-h] [--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR] [--dataset_files [DATASET_FILES ...]] [--train_reference_files [TRAIN_REFERENCE_FILES ...]] [--start_cycle START_CYCLE] [--num_train_samples NUM_TRAIN_SAMPLES] [--num_tune_samples NUM_TUNE_SAMPLES]
[--num_cycle_samples NUM_CYCLE_SAMPLES] --prefix PREFIX
optional arguments:
-h, --help show this help message and exit
--data_dir DATA_DIR Path to the database directory
--checkpoint_dir CHECKPOINT_DIR
Path to save trained models
--dataset_files [DATASET_FILES ...]
Dataset samples in one or multiple files
--train_reference_files [TRAIN_REFERENCE_FILES ...]
Use these files as references to extract initial training samples from dataset files
--start_cycle START_CYCLE
Start active learning train cycle
--num_train_samples NUM_TRAIN_SAMPLES
Number of samples randomly selected from merged train files for initial training
--num_tune_samples NUM_TUNE_SAMPLES
Number of samples for tuning models
--num_cycle_samples NUM_CYCLE_SAMPLES
Number of samples selected by active learning approache to be added to the previous training set
--prefix PREFIX Prefix used for the name of the files generated during the active learning procedure
Please refer to the paper for more information regarding the dataset generated from the Egyptian Colloquial Arabic Lexicon (ECAL).
The baseline model extended for the experiments in this work is based on the character-level transducer model presented in shijie-wu/neural-transducer.
MIT