Skip to content

Public repository for my final year project for Integrated Computer Science in Trinity College Dublin

License

Notifications You must be signed in to change notification settings

PinPinIre/Final-Year-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Year Project

My final year project for the Computer Science course in Trinity College Dublin.

The project is assessing the performance of 3 machine learning algorithms for topic-modelling and text clustering.

The algorithms being investigated are:

  • LDA (latent Dirichlet allocation)
  • KNN (K-Nearest Neighbours)
  • Word2Vec

To preprocess the the pdf files run the "src/scripts/process_pdfs.sh" script on the the corpus to convert to plain text. The "src/scripts/sort_corpus.sh" script can then be used to sort the files into directories based on their arXiv topics and genrate a log file of their distributions.

Run the "src/run_algorithm.py" python script to generate the models. Run the src/run_similarity.py python script to query the models.

About

Public repository for my final year project for Integrated Computer Science in Trinity College Dublin

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published