Sequence-to-Sequence Learning for Indonesian Automatic Question Generator (AQG)

(*For README in Indonesian, refer to this one instead)

A deep-learning-based Indonesian AQG built using Google-translated SQuAD v2.0 dataset. Our research paper here:

IEEE
arXiv

This research uses mainly OpenNMT library for training and inference.

And finally, this README only covers our best models - RNN-based and Transformer-based models using OpenNMT. (Self-implementation, and GPT2-based model using huggingface is not explained here)

Architecture

BiRNN (BiLSTM dan BiGRU)

Transformer

Requirements

pip install -r requirements.txt

*These requirements do not cover huggingface library either.

Download Datasets

You should download the processed dataset (1.i, and 2.i) if you wish to reproduce the model. Put the downloaded processed dataset in data/processed.

However I also provide link to the original dataset. Put them in data/raw for this original dataset.

SQuAD v2.0:
1. Processed (translated, augmented with linguistic features): processed train & dev set
2. Original SQuAD v2.0: train set, dev set
TyDiQA GoldPassage:
1. Processed (translated, augmented with linguistic features): processed train & dev set
2. Indonesian only [this script only runs in Linux-based terminal]:
```
python src/data/download_tydiqa_goldpassage_indonesian.py
```
3. Original TyDiQA GoldPassage All language

Download Word Embeddings

All models in this research utilize Indonesian part of Fastext's word vector for 157 languages.. This word embedding is converted to GloVe format as OpenNMT only support GloVe-formatted word embedding.

You can download the converted GloVe-formatted Fasttext word embedding here.

Then put the word-embedding in models/word-embedding/ft_to_gl_300_id.vec.

Notebooks

All notebooks are stored in notebook directory, and were mainly used for data and method exploration. You can ignore these notebooks if you seek to reproduce the models.

Prepare Data

This step will convert the downloaded processed dataset into txts containing paragraphs (input), and questions (target).

We use SQuAD and TyDiQA dataset, as well as Uncased and Cased.

SQuAD v2.0

Cased

python src/preprocess/prepare_data.py \
    --dataset_name=squad_id \
    --train_squad_path=data/processed/train-v2.0-translated_fixed_enhanced.json \
    --dev_squad_path=data/processed/train-v2.0-translated_fixed_enhanced.json \
    --train_val_split=0.9 \
    --src_max_len=70 \
    --tgt_max_len=20 \
    --seed 42

Uncased

python src/preprocess/prepare_data.py \
    --dataset_name=squad_id \
    --train_squad_path=data/processed/train-v2.0-translated_fixed_enhanced.json \
    --dev_squad_path=data/processed/dev-v2.0-translated_fixed_enhanced.json \
    --train_val_split=0.9 \
    --src_max_len=70 \
    --tgt_max_len=20 \
    --lower
    --seed 42

TyDiQA GoldPassage (we finally only used the dev set [the ones generated and stored in data/processed/test/tydiqa_id*])

Cased

python src/preprocess/prepare_data.py \
    --dataset_name=tydiqa_id \
    --train_squad_path=data/processed/tydiqa-goldp-v1.1-train-indonesian_prepared_enhanced.json \
    --dev_squad_path=data/processed/tydiqa-goldp-v1.1-dev-indonesian_prepared_enhanced.json \
    --train_val_split=0.9 \
    --src_max_len=70 \
    --tgt_max_len=20 \
    --seed 42

Uncased

python src/preprocess/prepare_data.py \
    --dataset_name=tydiqa_id \
    --train_squad_path=data/processed/tydiqa-goldp-v1.1-train-indonesian_prepared_enhanced.json \
    --dev_squad_path=data/processed/tydiqa-goldp-v1.1-dev-indonesian_prepared_enhanced.json \
    --train_val_split=0.9 \
    --src_max_len=70 \
    --tgt_max_len=20 \
    --lower
    --seed 42

After executing these four scripts, you will have 8 files in each of data/processed/[train|val|test] directories with name:

[squad|tydiqa]_id_split0.9_[uncased|cased]_[source|target].txt

Training and Evaluation

Create the directory to store prediction results
```
mkdir -p reports/txts/onmt
```
Then you can find all models' training and evaluation scripts in src/onmt/config. These scripts are not recommended to be directly executed, instead open them manually with a text editor, and copy-paste them.

We had prepared the configuration scripts to be as self-explanatory as possible. For complete OpenNMT preprocess/train/inference parameter, check the original documentation.

Our Best Model Configuration Names

We keep all model configurations. Some configurations resulting the best models defined in our paper are:

Model	Configuration Name
BiGRU-3
Cased	gru_45
Cased-Copy	gru_43
Cased-Copy-Coverage	gru_33
Uncased	gru_41
Uncased-Copy	gru_39
Uncased-Copy-Coverage	gru_37
BiLSTM-3
Cased	lstm_44
Cased-Copy	lstm_45
Cased-Copy-Coverage	lstm_32
Uncased	lstm_40
Uncased-Copy	lstm_38
Uncased-Copy-Coverage	lstm_36
Transformer-3
Cased	transformer_11
Cased-Copy	transformer_12
Uncased	transformer_14
Uncased-Copy	transformer_13

Prediction and Evaluation Logs

You can find all logs in:

Prediction: reports/txts/onmt/<configuration_name>*_pred.txt
Evaluation: reports/txts/onmt/eval_log*.txt

Running Free-input Question Generation

Currently these models are heavily dependent on third party API from Prosa.ai for POS (Part of Speech) and NE (Named Entity).

There are some provided scripts to run the free-input generation, but as the API is not publicly accessible,
you are unable to use free-text input to generate questions.

If you have access however, you will be able to execute this script:

python src/onmt/run_free_generation.py \
    --preprocess_output_path=free_input_001.txt \
    --uncased \
    --pred_output_path=free_input_001_pred.txt \
    --model_path=model/final/gru_037_step_16050.pt \
    --beam_size=2

Authors

Ferdiant Joshua Muis (Institut Teknologi Bandung)
Dr. Eng. Ayu Purwarianti, ST.,MT. (Institut Teknologi Bandung & U-CoE AI-VLB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sequence-to-Sequence Learning for Indonesian Automatic Question Generator (AQG)

Architecture

BiRNN (BiLSTM dan BiGRU)

Transformer

Requirements

Download Datasets

Download Word Embeddings

Notebooks

Prepare Data

Training and Evaluation

Our Best Model Configuration Names

Prediction and Evaluation Logs

Running Free-input Question Generation

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sequence-to-Sequence Learning for Indonesian Automatic Question Generator (AQG)

Architecture

BiRNN (BiLSTM dan BiGRU)

Transformer

Requirements

Download Datasets

Download Word Embeddings

Notebooks

Prepare Data

Training and Evaluation

Our Best Model Configuration Names

Prediction and Evaluation Logs

Running Free-input Question Generation

Authors