Skip to content

This repository is an example for the structure and content that a thesis project may have.

License

Notifications You must be signed in to change notification settings

helenZhangJY/keyword_based_text_classification

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keyword-based text classification method

Author: Jingyue Zhang

This repository is part of the 2022 Mater of Text mining thesis "Mapping text to learning objectives: A keyword-based text classification method" by Jingyue Zhang. This thesis project was part of an internship offered by EDIA and was supervised by Hennie van der Vliet, Pia Sommerauer, Mark Breuker and Isa Maks.

Project structure

thesis-project
└───data
│       │   dev_data.csv 
│       │   test_data.csv
│       │   LO_picklefile
│       │   stopwords
└───results
│       │   evaluation.ipynb 
└───code
│   │   crawl_dev_data.ipynb 
│   │   basic_statistics.ipynb
│   │   run_dev_bl.ipynb
│   │   run_dev_w2v.ipynb
│   │   run_test_bl.ipynb
│   │   run_test_w2v.ipynb
└───LICENSE
└───README.md

Thesis report

When the thesis report is officially published on the CLTL website, the link will be shared here afterwards.

Data

In the 'Data' folder you can find two csv files 'dev_data.csv' and 'test_data.csv' with all the gold labels and assigned labels generated by baseline system, word2vec system and EDIA's system.

There are two sub folders: stopwords folder contain English stopwords list and Dutch stopwords list. LO_picklefile folders contain learning objectives pickle files for GCSE Combined Science and SURF Lexical Information Literacy (Vocabulaire Informatievaardigheid).

Code

In this folder you can find 6 jupyter notebook files, in each file you can find the code and outcomes of the code.

  • cwarl_dev_dataset.ipynb: generate development dataset.
  • basic_statistics.ipynb: generate basic statistics on development data and test data.
  • run_dev_bl.ipynb: clean developemnt data, extract keywords from development data, generate labels based on baseline system.
  • run_dev_w2v.ipynb: clean developemnt data, extract keywords from development data, generate labels based on word2vec system.
  • run_test_bl.ipynb: clean test data, extract keywords from test data, generate labels based on baseline system.
  • run_tsst_w2v.ipynb: clean test data, extract keywords from test data, generate labels based on word2vec system.

Resources

Word Embeddings: Due to the large size, the open-source pre-trained Englsih and Dutch word embeddings were not uploaded here. However, to be able to run the word2vec system, you need those word embeddings. These can be downloaded from the following page: http://vectors.nlpl.eu/repository/

References

Keywords Extraction Using TF-IDF Method https://www.kaggle.com/code/rowhitswami/keywords-extraction-using-tf-idf-method/notebook

Cosine Similarity between two sets of vectors? https://math.stackexchange.com/questions/1122423/cosine-similarity-between-two-sets-of-vectors

About

This repository is an example for the structure and content that a thesis project may have.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%