This repository contains the code written for the Master Thesis project "Matching Ontologies in the Education Domain with Semantic Similarity". This work was performed in collaboration with the EdTech company Wizenoze, authored by Adrielli T. Lopes Rego and supervised by Dr. Lisa Beinborn and Dr. Thijs Westerveld.
The thesis addresses the task of matching learning objectives across education curricula. In this digital age of information overload, aligning education curricula can facilitate the sharing of education resources and consequently the curation of digital information according to teaching and learning needs. Cross-curriculum alignment is challenging, as learning objectives are often defined and structured differently across curricula.
I propose a model based on semantic similarity, with the focus on finding representations that can capture the semantic relations relevant for matching. In addition, the information contained in lower and higher layers of the curricula are used to enrich representations.
For further details about the project, please refer to the thesis report LopesRego_MA_thesis.pdf
.
The experiments run in this project can be reproduced by entering the following to the command line, in the order in which they appear:
-
data_exploration.py
to generate descriptive stats regarding data. -
main.py --model tf-idf
to match learning objectives with TF-IDF encodings. -
main.py --model ../models/cc.en.300.bin
to match learning objectives with Fasttext encodings. -
main.py --model sentence-transformers/paraphrase-MiniLM-L6-v2
to match learning objectives with pre-trained SBERT encodings. -
finetune_sbert.py
to fine-tune SBERT for matching learning objectives. -
main.py --model ../models/paraphrase-sbert-label-rankingloss-nodup_7 --random_seed 7
to match learning objectives with fine-tuned SBERT with random seed 7. Also run this with random seeds 42 and 13 to obtain the averaged results. -
main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title --random_seed 7
to match learning objectives with fine-tuned SBERT, random seed 7, and document titles added to candidates. Other document segments to experiment with includedoc_title,doc_sum1sent
anddoc_title,doc_sumnsents
. -
main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title,topic,grade,subject --random_seed 7
to match learning objectives with fine-tuned SBERT, random seed 7, and document titles plus higher layers of the curricula. Other combinations of higher layers includedtopic
,grade
,subject
,topic,grade
,topic,subject
,grade,subject
. -
train_ltr.py --model_save ../models/ltr_7_grade,subject,topic.txt --features grade,subject,topic --random_seed 7
to train LambdaMART for re-ranking module with random_seed 7. To get averaged results, repeat this step with random seeds 42 and 13. -
main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title --random_seed 7 --method re-rank --rerank_model ../models/ltr_7_grade,subject,topic.txt --features_rerank grade,subject,topic
to match learning objectives with re-ranking module (random seed 7) and higher layers topic, grade and subject. Other combinations of higher layers includetopic
,grade
,subject
,grade,subject
. -
train_classifier.py
to train classifier with DistilBERT as cross-encoder for learning objective text (and document titles in the case of the candidates), topic and subject, one-hot encodings with embedding layer for age, and a neural network with softmax to classify learning objective pairs as either a match or a mismatch. -
main.py --model ../models/model_weights.pth --features doc_title,topic,grade,subject --random_seed 13 --method classification --uncase True
to match learning objectives using trained classifier. This will take only a subset (n=50) of the anchor learning objectives, due to its low speed. Keep in mind this can take around 1.5 hours for each anchor using a standard GPU. -
evaluation.py --results ../results/test_paraphrase-sbert-label-title-rankingloss-nodup_7_top5_doc_title,topic,subject.csv --gold ../data/test_query_pairs_7.csv
to generate evaluation scores for the specified results file, given the respective gold file. Repeat this for each file generated by each running ofmain.py
, with the appropriate result filepath and gold filepath. -
error_analysis.py
to generate results of error analysis.
Additionally, utils.py
contains general helper functions, generate_search_space.py
contains functions to filter search space, match_learning_objectives
contains functions to encode anchor and candidates, compute cosines and generate ranking, and ltr.py
contains functions to train learning-to-rank model and re-rank candidates with trained model.
This folder should contain the data to be used for matching.
-
To use the TF-IDF encoder, a special authorized key is needed. This is because the model was built by the Wizenoze Science Team and is not available for free. If you would like reproduce this experimental step, please contact adrielli.drica@gmail.com.
-
To use the Fasttext encoder, please download the freely available pre-trained embeddings
wiki-news-300d-1M.vec.zip
on https://fasttext.cc/docs/en/english-vectors.html. -
To use the pre-trained SBERT encoder, please install the transformers huggingface library, specified on
requirements.txt
. -
To fine-tune SBERT for matching learning objectives:
finetune_sbert.py
-
To train learning-to-rank model for re-ranking module:
train_ltr.py --model_save --features --random_seed
.--model_save
is the filepath to save trained model and--features
are the higher layers to be used as features. -
To train classifier for matching leraning objectives with cross-encoders and a neural head:
train_classifier.py
This folder stores the results of evaluation.py
in json.
This folder stores the results of main.py
in csv.