Skip to content

Latest commit

 

History

History
74 lines (39 loc) · 6.01 KB

README.md

File metadata and controls

74 lines (39 loc) · 6.01 KB

Education Curriculum Alignment

This repository contains the code written for the Master Thesis project "Matching Ontologies in the Education Domain with Semantic Similarity". This work was performed in collaboration with the EdTech company Wizenoze, authored by Adrielli T. Lopes Rego and supervised by Dr. Lisa Beinborn and Dr. Thijs Westerveld.

The thesis addresses the task of matching learning objectives across education curricula. In this digital age of information overload, aligning education curricula can facilitate the sharing of education resources and consequently the curation of digital information according to teaching and learning needs. Cross-curriculum alignment is challenging, as learning objectives are often defined and structured differently across curricula.

I propose a model based on semantic similarity, with the focus on finding representations that can capture the semantic relations relevant for matching. In addition, the information contained in lower and higher layers of the curricula are used to enrich representations.

For further details about the project, please refer to the thesis report LopesRego_MA_thesis.pdf.

Folders

/source

The experiments run in this project can be reproduced by entering the following to the command line, in the order in which they appear:

  • data_exploration.py to generate descriptive stats regarding data.

  • main.py --model tf-idf to match learning objectives with TF-IDF encodings.

  • main.py --model ../models/cc.en.300.bin to match learning objectives with Fasttext encodings.

  • main.py --model sentence-transformers/paraphrase-MiniLM-L6-v2to match learning objectives with pre-trained SBERT encodings.

  • finetune_sbert.py to fine-tune SBERT for matching learning objectives.

  • main.py --model ../models/paraphrase-sbert-label-rankingloss-nodup_7 --random_seed 7 to match learning objectives with fine-tuned SBERT with random seed 7. Also run this with random seeds 42 and 13 to obtain the averaged results.

  • main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title --random_seed 7 to match learning objectives with fine-tuned SBERT, random seed 7, and document titles added to candidates. Other document segments to experiment with include doc_title,doc_sum1sent and doc_title,doc_sumnsents.

  • main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title,topic,grade,subject --random_seed 7 to match learning objectives with fine-tuned SBERT, random seed 7, and document titles plus higher layers of the curricula. Other combinations of higher layers included topic, grade, subject, topic,grade, topic,subject, grade,subject.

  • train_ltr.py --model_save ../models/ltr_7_grade,subject,topic.txt --features grade,subject,topic --random_seed 7 to train LambdaMART for re-ranking module with random_seed 7. To get averaged results, repeat this step with random seeds 42 and 13.

  • main.py --model ../models/paraphrase-sbert-label-title-rankingloss-nodup_7 --features doc_title --random_seed 7 --method re-rank --rerank_model ../models/ltr_7_grade,subject,topic.txt --features_rerank grade,subject,topic to match learning objectives with re-ranking module (random seed 7) and higher layers topic, grade and subject. Other combinations of higher layers include topic, grade, subject, grade,subject.

  • train_classifier.py to train classifier with DistilBERT as cross-encoder for learning objective text (and document titles in the case of the candidates), topic and subject, one-hot encodings with embedding layer for age, and a neural network with softmax to classify learning objective pairs as either a match or a mismatch.

  • main.py --model ../models/model_weights.pth --features doc_title,topic,grade,subject --random_seed 13 --method classification --uncase True to match learning objectives using trained classifier. This will take only a subset (n=50) of the anchor learning objectives, due to its low speed. Keep in mind this can take around 1.5 hours for each anchor using a standard GPU.

  • evaluation.py --results ../results/test_paraphrase-sbert-label-title-rankingloss-nodup_7_top5_doc_title,topic,subject.csv --gold ../data/test_query_pairs_7.csv to generate evaluation scores for the specified results file, given the respective gold file. Repeat this for each file generated by each running of main.py, with the appropriate result filepath and gold filepath.

  • error_analysis.py to generate results of error analysis.

Additionally, utils.py contains general helper functions, generate_search_space.py contains functions to filter search space, match_learning_objectives contains functions to encode anchor and candidates, compute cosines and generate ranking, and ltr.py contains functions to train learning-to-rank model and re-rank candidates with trained model.

/data

This folder should contain the data to be used for matching.

/models

  • To use the TF-IDF encoder, a special authorized key is needed. This is because the model was built by the Wizenoze Science Team and is not available for free. If you would like reproduce this experimental step, please contact adrielli.drica@gmail.com.

  • To use the Fasttext encoder, please download the freely available pre-trained embeddings wiki-news-300d-1M.vec.zip on https://fasttext.cc/docs/en/english-vectors.html.

  • To use the pre-trained SBERT encoder, please install the transformers huggingface library, specified on requirements.txt.

  • To fine-tune SBERT for matching learning objectives: finetune_sbert.py

  • To train learning-to-rank model for re-ranking module: train_ltr.py --model_save --features --random_seed. --model_save is the filepath to save trained model and --features are the higher layers to be used as features.

  • To train classifier for matching leraning objectives with cross-encoders and a neural head: train_classifier.py

/eval

This folder stores the results of evaluation.py in json.

/results

This folder stores the results of main.py in csv.