Skip to content

This repository contains files and information about step 1 of Kaphta Architecture: Text classification of PubMed abstracts on anticancer activity, using the R language.

Notifications You must be signed in to change notification settings

ramongsilva/Text-classification-of-pubmed-abstracts-on-polyphenols-anticancer-activity

Repository files navigation

Text classification of PubMed abstracts on anticancer activity

This repository contains files and information about the step 1 of Kaphta Architecture: Text classification of PubMed abstracts on anticancer activity. The text classification was based on the ensemble method. In the creation (training and tests) of the ensemble were selected four machine learning algorithms with better accuracy. Below, there are information about the files:

  • Rotulated-corpus.rar: PubMed abstracts textual corpus rotulated for training and tests of machine learning algorithms used in ensemble creation. Save this file in the same folder of training-and-text-classification-gh.R script, because it is needed to execute the script.
  • training-and-text-classification-gh.R: R script for creation of the ensemble for text classification of PubMed abstracts on anticancer activity.
  • db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of training-and-text-classification-gh.R script, because it is needed to execute the script.
  • Entities Dictionary: folder with files and details about entity dictionary created for Kaphta architecture.

For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.

Results of Text Classification

  • PubMed-PMID-abstracts-positives.tsv: tsv file with PubMed abstracts classified as positive in text classification based on ensemble method. Attention: The PubMed abstracts classified as positive are available in db_total_project.db SQLite file too.

Results of training of machine learning algorithms

Below is presented a table with the resulted measures of the training of supervised machine learning algorithms. The ensemble was constructed by combining the four classifiers with the best accuracies: LogitBoost, Randon Forest, Support Vector Machine, and Maximum entropy.

Table with results of the training of machine learning algorithms

About

This repository contains files and information about step 1 of Kaphta Architecture: Text classification of PubMed abstracts on anticancer activity, using the R language.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages