This repository contains files and information about the step 1 of Kaphta Architecture: Text classification of PubMed abstracts on anticancer activity. The text classification was based on the ensemble method. In the creation (training and tests) of the ensemble were selected four machine learning algorithms with better accuracy. Below, there are information about the files:
- Rotulated-corpus.rar: PubMed abstracts textual corpus rotulated for training and tests of machine learning algorithms used in ensemble creation. Save this file in the same folder of training-and-text-classification-gh.R script, because it is needed to execute the script.
- training-and-text-classification-gh.R: R script for creation of the ensemble for text classification of PubMed abstracts on anticancer activity.
- db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of training-and-text-classification-gh.R script, because it is needed to execute the script.
- Entities Dictionary: folder with files and details about entity dictionary created for Kaphta architecture.
For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.
- PubMed-PMID-abstracts-positives.tsv: tsv file with PubMed abstracts classified as positive in text classification based on ensemble method. Attention: The PubMed abstracts classified as positive are available in db_total_project.db SQLite file too.
Below is presented a table with the resulted measures of the training of supervised machine learning algorithms. The ensemble was constructed by combining the four classifiers with the best accuracies: LogitBoost, Randon Forest, Support Vector Machine, and Maximum entropy.