An approach to generate questions and answers from comprehensions, by implementing a similar technique of Zhou, et al. trained on Pytorch.
Project link: https://github.com/gsasikiran/automatic-question-generation
-
Stanford Question Answering Dataset (SQuAD 2.0)
-
Dataset comprises of comprehensions, questions and answers designed for question answering task
-
The following excerpt is taken from SQuAD 2.0 website
- Language : Python
- Packages
- torch : 1.5.0
- matplotlib :3.2.1
- pandas : 1.0.5
- spacy : 2.2.4
- numpy : 1.18.5
- torchtext : 0.3.1
- nltk : 3.5
For more information of code and training check main/README.md
We implement:
- Case normalization
- Tokenization
- Named entity recognition (NER)
- POS-Tagging
- IOB-Tagging
- Pairing input and output
We input the preprocessed input and output pairs through the Seq2Seq attention architecture. The architecture comprises of encoder and decoder. Encoder is built upon the bi-directional GRU and the decoder consists of stacked GRU architectures. The attention mechanism calculates the context vectors at every step of decoder. Attention is computed by taking the softmax of most related hidden vectors of the encoder to the current hidden vector at the decoder. The model architecture is shown below.
We created Google forms with random 25 excerpts and questions to vote from 1-3 as follows:
- 3 : Question is meaningful and relates to the paragraph
- 2 : Question is more or less meaningful and may relate to the paragraph
- 1 : Question do not carry any meaning
The results through human evaluation are defined below. The mean score and correspondability between evaluators (Fleiss' Kappa Score) are given below
- Mean score: 1.750 (greater than [1])
- Fleiss' Kappa score: 0.238 (Fair agreement between evaluators)
- Meteor Score: 0.1819
- BLEU Score: 0.0216
[1] Zhou, Qingyu, et al. "Neural question generation from text: A preliminary study." National CCF Conference on Natural Language Processing and Chinese Computing. Springer, Cham, 2017.
[2] Fleiss, Joseph L. "Measuring nominal scale agreement among many raters." Psychological bulletin 76.5 (1971): 378.