- This is the code for the MLRC2020 challenge w.r.t. the ACL 2020 paper Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings[1]
- The code is build upon [1]:5d8fdbd4
- Minor modifications have been made to 5d8fdbd4 in order to perform the ablation study. In case of any query relating to the original code[1], please contact Apoorv.
- Knowledge Graph Embedding model
- TuckER
- Tested on {MetaQA_full, MetaQA_half} datasets
- Question embedding models
- ALBERT
- XLNet
- Longformer
- SentenceBERT (SentenceTransformer)
- Tested on {fbwq_full, fbwq_half} datasets
- Python >= 3.7.5, pip
- zip, unzip
- Docker (Recommended)
- Pytorch version 1.3.0a0+24ae9b5. For more info, visit here.
-
Docker Image: Cuda-Python[2] can be used. Use the
runtime
tag.-
docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7 -p 7777:7777 qts8n/cuda-python:runtime
-
-
Alternatively, Docker Image: Embed_KGQA[3] can be used as well. It's build upon [2] and contains all the packages for conducting the experiments.
- Use
env
tag for image without models. - Use
env-models
tag for image with models. -
docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7 -p 7777:7777 jishnup/embed_kgqa:env
- All the required packages and models (from the extended study with better performance) are readily available in [3].
- Model location within the docker container:
/raid/mlrc2020models/
/raid/mlrc2020models/embeddings/
contain the KG embedding models./raid/mlrc2020models/qa_models/
contain the QA models.
- Model location within the docker container:
- Use
-
The experiments have been done using [2]. The requirements.txt packages' version have been set accordingly. This may vary w.r.t. [1].
-
KGQA/LSTM
andKGQA/RoBERTa
directory nomenclature hasn't been changed to avoid unnecessary confusion w.r.t. the original codebase[1]. -
fbwq_full
andfbwq_full_new
are the same but independent existence is required because- Pretrained
ComplEx
model usesfbwq_full_new
as the dataset name - Trained
SimplE
model usesfbwq_full
as the dataset name
- Pretrained
-
No
fbwq_full_new
dataset was found in the data shared by the author[1], so went ahead with this setting. -
Also, pretrained qa_models were absent in the data shared. The reproduction results are based on training scheme used by us.
-
For training QA datasets, use
batch_size >= 2
.
# Clone the repo
git clone https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA && cd "$_"
# Set a new env variable called EMBED_KGQA_DIR with MLRC2020-EmbedKGQA/ directory's absolute path as value
# If using bash shell, run
echo 'export EMBED_KGQA_DIR=`pwd`' >> ~/.bash_profile && source ~/.bash_profile
# Change script permissions
chmod -R 700 scripts/
# Initial setup
./scripts/initial_setup.sh
# Download and unzip, data and pretrained_models from the original EmbedKGQA paper
./scripts/download_artifacts.sh
# Install LibKGE
./scripts/install_libkge.sh
- Steps to train KG embeddings.
Hyperparameters in the following commands are set w.r.t. [1].
# Method: 1
cd $EMBED_KGQA_DIR/KGQA/LSTM;
python main.py --mode train \
--nb_epochs 100 \
--relation_dim 200 \
--hidden_dim 256 \
--gpu 0 \ #GPU-ID
--freeze 0 \
--batch_size 64 \
--validate_every 4 \
--hops <1/2/3> \ #n-hops
--lr 0.0005 \
--entdrop 0.1 \
--reldrop 0.2 \
--scoredrop 0.2 \
--decay 1.0 \
--model <ComplEx/TuckER> \ #KGE models
--patience 10 \
--ls 0.0 \
--use_cuda True \ #Enable CUDA
--kg_type <half/full>
# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_metaQA.sh \
<ComplEX/TuckER> \
<half/full> \
<1/2/3> \
<batch_size> \
<gpu_id> \
<relation_dim>
# Method: 1
cd $EMBED_KGQA_DIR/KGQA/RoBERTa;
python main.py --mode train \
--relation_dim 200 \
--que_embedding_model RoBERTa \
--do_batch_norm 0 \
--gpu 0 \
--freeze 1 \
--batch_size 16 \
--validate_every 10 \
--hops webqsp_half \
--lr 0.00002 \
--entdrop 0.0
--reldrop 0.0 \
--scoredrop 0.0 \
--decay 1.0 \
--model ComplEx \
--patience 20 \
--ls 0.0 \
--l3_reg 0.001 \
--nb_epochs 200 \
--outfile delete
# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_webqsp.sh \
<ComplEx/SimplE> \
<RoBERTa/ALBERT/XLNet/Longformer/SentenceTransformer> \
<half/full> \
<batch_size> \
<gpu_id> \
<relation_dim>
Set the mode parameter as test
(keep the other hyperparameters same as used in training)
- Details about data and pretrained weights.
- Details about dataset creation.
- Presentation for [1] by Apoorv.
Please cite the following if you incorporate our work.
@article{P:2021,
author = {P, Jishnu Jaykumar and Sardana, Ashish},
title = {{[Re] Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings}},
journal = {ReScience C},
year = {2021},
month = may,
volume = {7},
number = {2},
pages = {{#15}},
doi = {10.5281/zenodo.4834942},
url = {https://zenodo.org/record/4834942/files/article.pdf},
code_url = {https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA},
code_doi = {},
code_swh = {swh:1:dir:c95bc4fec7023c258c7190975279b5baf6ef6725},
data_url = {},
data_doi = {},
review_url = {https://openreview.net/forum?id=VFAwCMdWY7},
type = {Replication},
language = {Python},
domain = {ML Reproducibility Challenge 2020},
keywords = {knowledge graph, embeddings, multi-hop, question-answering, deep learning}
}
Following 3 options are available for any clarification, comments or suggestions
- Join the discussion forum.
- Create an issue.
- Contact Jishnu or Ashish.