Improving Question Answering Performance Using Knowledge Distillation and Active Learning
Contemporary question answering (QA) systems, including Transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Furthermore, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained bidirectional encoder representations from transformer (BERT) system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. We show the efficacy of our approach by comparing it with four state-of-the-art (SOTA) Transformers-based systems, namely KroneckerBERT, EfficientBERT, TinyBERT, and DistilBERT. Specifically, we outperform KroneckerBERT21 and EfficientBERTTINY by 4.5 and 0.4 percentage points in EM, despite having 75.0% and 86.2% fewer parameters, respectively. Additionally, our approach achieves comparable performance to 6-layer TinyBERT and DistilBERT while using only 2% of their total trainable parameters. Besides, by the integration of our AL approaches into the BERT framework, we show that SOTA results on the QA datasets can be achieved when we only use 40% of the training data. Overall, all results demonstrate the effectiveness of our approaches in achieving SOTA performance, while extremely reducing the number of parameters and labeling efforts.
@article{BORESHBAN2023106137,
title = {Improving question answering performance using knowledge distillation and active learning},
journal = {Engineering Applications of Artificial Intelligence},
volume = {123},
pages = {106137},
year = {2023},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2023.106137},
url = {https://www.sciencedirect.com/science/article/pii/S0952197623003214},
author = {Yasaman Boreshban and Seyed Morteza Mirbostani and Gholamreza Ghassem-Sani and Seyed Abolghasem Mirroshandel and Shahin Amiriparian},
keywords = {Natural language processing, Question answering, Deep learning, Knowledge distillation, Active learning, Performance},
abstract = {Contemporary question answering (QA) systems, including Transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Furthermore, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained bidirectional encoder representations from transformer (BERT) system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. We show the efficacy of our approach by comparing it with four state-of-the-art (SOTA) Transformers-based systems, namely KroneckerBERT, EfficientBERT, TinyBERT, and DistilBERT. Specifically, we outperform KroneckerBERT21 and EfficientBERTTINY by 4.5 and 0.4 percentage points in EM, despite having 75.0% and 86.2% fewer parameters, respectively. Additionally, our approach achieves comparable performance to 6-layer TinyBERT and DistilBERT while using only 2% of their total trainable parameters. Besides, by the integration of our AL approaches into the BERT framework, we show that SOTA results on the QA datasets can be achieved when we only use 40% of the training data. Overall, all results demonstrate the effectiveness of our approaches in achieving SOTA performance, while extremely reducing the number of parameters and labeling efforts. Finally, we make our code publicly available at https://github.com/mirbostani/QA-KD-AL.}
}
- Python 3.8.3
- PyTorch 1.6.0
- Spacy 2.3.2
- NumPy 1.19.5
- Transformers 4.6.1
- QANet (Student)
- QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension [arXiv: 1804.09541v1]
- The model implementation is based on BangLiu/QANet-PyTorch and andy840314/QANet-pytorch-.
- BERT (Teacher)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [arXiv: 1810.04805]
- HuggingFace Transformers is used for the model implementation.
Use download.sh
to download and extract the required datasets automatically.
- GloVe
- SQuAD v1.1
- Adversarial SQuAD
- sample1k-HCVerifyAll (AddSent)
- sample1k-HCVerifySample (AddOneSent)
Any BERT-based model selected from these models can be used as a teacher.
$ python main.py \
--train true \
--epochs 30 \
--use_cuda true \
--use_kd true \
--student "qanet" \
--batch_size 14 \
--teacher "bert" \
--teacher_model_or_path "bert-large-uncased-whole-word-masking-finetuned-squad" \
--teacher_tokenizer_or_path "bert-large-uncased-whole-word-masking-finetuned-squad" \
--teacher_batch_size 32 \
--temperature 10 \
--alpha 0.7 \
--interpolation "linear"
The active learning datasets based on the least confidence strategy are provided in ./data/active
.
$ python main.py \
--train true \
--epochs 30 \
--use_cuda true \
--use_kd false \
--student "qanet" \
--batch_size 14 \
--train_file ./data/active/train_active_lc5_40.json
Before combining knowledge distillation and active learning to train the student model, you have to finetune the teacher model (e.g., BERT-Large) with one of the active learning datasets provided in the ./data/active
directory.
$ python main.py \
--train true \
--epochs 30 \
--use_cuda true \
--use_kd false \
--student "qanet" \
--batch_size 14 \
--teacher "bert" \
--teacher_batch_size 32 \
--teacher_model_or_path ./processed/bert-finetuned-active-lc5-40 \
--teacher_tokenizer_or_path ./processed/bert-finetuned-active-lc5-40 \
--temperature 10 \
--alpha 0.7 \
--interpolation "linear" \
--train_file ./data/active/train_active_lc5_40.json
After a successful evaluation, the results will be saved in the ./processed/evaluation
directory by default.
$ python main.py \
--evaluate true \
--use_cuda true \
--student "qanet" \
--dev_file ./data/squad/dev-v1.1.json \
--processed_data_dir ./processed/data \
--resume ./processed/checkpoints/model_best.pth.tar