Skip to content

Improving Question Answering Performance Using Knowledge Distillation and Active Learning

License

Notifications You must be signed in to change notification settings

mirbostani/QA-KD-AL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QA-KD-AL

Improving Question Answering Performance Using Knowledge Distillation and Active Learning

Paper

Abstract

Contemporary question answering (QA) systems, including Transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Furthermore, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained bidirectional encoder representations from transformer (BERT) system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. We show the efficacy of our approach by comparing it with four state-of-the-art (SOTA) Transformers-based systems, namely KroneckerBERT, EfficientBERT, TinyBERT, and DistilBERT. Specifically, we outperform KroneckerBERT21 and EfficientBERTTINY by 4.5 and 0.4 percentage points in EM, despite having 75.0% and 86.2% fewer parameters, respectively. Additionally, our approach achieves comparable performance to 6-layer TinyBERT and DistilBERT while using only 2% of their total trainable parameters. Besides, by the integration of our AL approaches into the BERT framework, we show that SOTA results on the QA datasets can be achieved when we only use 40% of the training data. Overall, all results demonstrate the effectiveness of our approaches in achieving SOTA performance, while extremely reducing the number of parameters and labeling efforts.

How to Cite

BibTeX

@article{BORESHBAN2023106137,
title = {Improving question answering performance using knowledge distillation and active learning},
journal = {Engineering Applications of Artificial Intelligence},
volume = {123},
pages = {106137},
year = {2023},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2023.106137},
url = {https://www.sciencedirect.com/science/article/pii/S0952197623003214},
author = {Yasaman Boreshban and Seyed Morteza Mirbostani and Gholamreza Ghassem-Sani and Seyed Abolghasem Mirroshandel and Shahin Amiriparian},
keywords = {Natural language processing, Question answering, Deep learning, Knowledge distillation, Active learning, Performance},
abstract = {Contemporary question answering (QA) systems, including Transformer-based architectures, suffer from increasing computational and model complexity which render them inefficient for real-world applications with limited resources. Furthermore, training or even fine-tuning such models requires a vast amount of labeled data which is often not available for the task at hand. In this manuscript, we conduct a comprehensive analysis of the mentioned challenges and introduce suitable countermeasures. We propose a novel knowledge distillation (KD) approach to reduce the parameter and model complexity of a pre-trained bidirectional encoder representations from transformer (BERT) system and utilize multiple active learning (AL) strategies for immense reduction in annotation efforts. We show the efficacy of our approach by comparing it with four state-of-the-art (SOTA) Transformers-based systems, namely KroneckerBERT, EfficientBERT, TinyBERT, and DistilBERT. Specifically, we outperform KroneckerBERT21 and EfficientBERTTINY by 4.5 and 0.4 percentage points in EM, despite having 75.0% and 86.2% fewer parameters, respectively. Additionally, our approach achieves comparable performance to 6-layer TinyBERT and DistilBERT while using only 2% of their total trainable parameters. Besides, by the integration of our AL approaches into the BERT framework, we show that SOTA results on the QA datasets can be achieved when we only use 40% of the training data. Overall, all results demonstrate the effectiveness of our approaches in achieving SOTA performance, while extremely reducing the number of parameters and labeling efforts. Finally, we make our code publicly available at https://github.com/mirbostani/QA-KD-AL.}
}

Requirements

  • Python 3.8.3
  • PyTorch 1.6.0
  • Spacy 2.3.2
  • NumPy 1.19.5
  • Transformers 4.6.1

Supported Models

Datasets

Use download.sh to download and extract the required datasets automatically.

Train the Student Model Using Knowledge Distillation

Any BERT-based model selected from these models can be used as a teacher.

$ python main.py \
    --train true \
    --epochs 30 \
    --use_cuda true \
    --use_kd true \
    --student "qanet" \
    --batch_size 14 \
    --teacher "bert" \
    --teacher_model_or_path "bert-large-uncased-whole-word-masking-finetuned-squad" \
    --teacher_tokenizer_or_path "bert-large-uncased-whole-word-masking-finetuned-squad" \
    --teacher_batch_size 32 \
    --temperature 10 \
    --alpha 0.7 \
    --interpolation "linear"

Train the Student Model Using Active Learning

The active learning datasets based on the least confidence strategy are provided in ./data/active.

$ python main.py \
    --train true \
    --epochs 30 \
    --use_cuda true \
    --use_kd false \
    --student "qanet" \
    --batch_size 14 \
    --train_file ./data/active/train_active_lc5_40.json

Train the Student Model Using Knowledge Distillation and Active Learning

Before combining knowledge distillation and active learning to train the student model, you have to finetune the teacher model (e.g., BERT-Large) with one of the active learning datasets provided in the ./data/active directory.

$ python main.py \
    --train true \
    --epochs 30 \
    --use_cuda true \
    --use_kd false \
    --student "qanet" \
    --batch_size 14 \
    --teacher "bert" \
    --teacher_batch_size 32 \
    --teacher_model_or_path ./processed/bert-finetuned-active-lc5-40 \
    --teacher_tokenizer_or_path ./processed/bert-finetuned-active-lc5-40 \
    --temperature 10 \
    --alpha 0.7 \
    --interpolation "linear" \
    --train_file ./data/active/train_active_lc5_40.json

Evaluate the Student Model

After a successful evaluation, the results will be saved in the ./processed/evaluation directory by default.

$ python main.py \
    --evaluate true \
    --use_cuda true \
    --student "qanet" \
    --dev_file ./data/squad/dev-v1.1.json \
    --processed_data_dir ./processed/data \
    --resume ./processed/checkpoints/model_best.pth.tar