Skip to content

dabane-ghassan/cazy-little-helper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAZy's little helper

Python version Licence

forthebadge forthebadge forthebadge package graphix

A model to predict the compatibility of scientific literature with the CAZy database, the endgoal would be to assist biocurators by giving a score of confidence for each article that asseses its compatibility with certain criteria needed to integrate the database.

Analysis pipeline

  • List of PMIDs/DOIsPMCIDs (if available, otherwise leave it as a PMID) → Full text (if available so can be scraped with Bilbio, otherwise get only the abstract via eutils) → PreprocessingRepresentationClassifying% Confidence score.

Installation

git clone https://github.com/dabane-ghassan/cazy-little-helper.git
cd cazy-little-helper
pip install -r requirements.txt
cd src

Getting started

Predict

usage: python3 predict.py [-h] -i INPUT_PATH [-p ID_POS] [-b BIBLIO_ADD] [-m MODEL]

Arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        [REQUIRED] The input data file path,a .csv file with a column of article IDs
  -p ID_POS, --id_pos ID_POS
                        [OPTIONAL] The index of the ID column in the input file path, default is 0 (first column).
  -b BIBLIO_ADD, --biblio_add BIBLIO_ADD
                        [OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
  -m MODEL, --model MODEL
                        [OPTIONAL] The model path to run the predictions, default is the CAZy's little helper already trained model based on Aug 2021 data, '../model/cazy_helper.joblib'

Create Model

usage: python3 create.py [-h] -p OUTPUT_PATH -d DATASET [-b BIBLIO_ADD] [-s VAL_SIZE]

Arguments:
  -h, --help            show this help message and exit
  -p OUTPUT_PATH, --output_path OUTPUT_PATH
                        [REQUIRED] The save path for the new model.
  -d DATASET, --dataset DATASET
                        [REQUIRED] The training dataset, a two column .csv file.
  -b BIBLIO_ADD, --biblio_add BIBLIO_ADD
                        [OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
  -s VAL_SIZE, --val_size VAL_SIZE
                        [OPTIONAL] The validation dataset size, default is 0.15
  • So let's say you're a passionate CAZy researcher, and you want to retrain a new model with new data in order to acquire more accurate confidence scores:
mkdir new_model && cd new_model
wget https://raw.githubusercontent.com/dabane-ghassan/cazy-little-helper/main/training/classifier_train.csv

Now you can annotate the training dataset with new PMCIDs (only PMCIDs) if available, just adding 1 to the label column if the new article is compatible, 0 otherwise. After this, we can launch the creation of a new model.

python3 create.py -p new_model.joblib -d classifier_train.csv
  • And now this new model can be used to make predictions! (by specifying its path using the parameter -m in the predict CLI)

Find IDs

  • This last functionality is just la cerise sur le gâteau, just input any .csv file with a column of article IDs (PMIDs, PMCIDs or DOIs); preferably a one column .csv file without a header, and it will be converted to the ID type of your choice.
usage: python3 find.py [-h] -i INPUT_PATH -t ID_TYPE

Arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        [REQUIRED] The input ID file path, a .csv file with a column of article IDs.
  -t ID_TYPE, --id_type ID_TYPE
                        [REQUIRED] The type of ID to find, ['PMID', 'PMCID', 'DOI'], uppercase only.

Under the hood: What is CAZy's little helper?

CAZy's little helper is a TF-IDF/SVM machine learning model, it uses Term Frequency - Inverse Document Frequency (TF-IDF) for text representation and a linear kernel Support Vector Machine for classification.

  • Performance:
Test dataset Precision Recall f1-score Support
CAZyDB- 0.97 0.98 0.98 1326
CAZyDB+ 0.81 0.74 0.77 135
Accuracy 0.96 1461
Macro average 0.89 0.86 0.88 1461
Weighted average 0.96 0.96 0.96 1461
  • Before choosing this particular architecture, a panel of Natural Language Processing (NLP) methods for text classification were used and tested based upon the custom-created dataset for the CAZy database; methods ranging from classical text representation tools like TF-IDF and word embeddings (Word2Vec) as well as unsupervised topic modeling using LDA (Latent Dirichlet Allocation), to even state-of-the-art deep learning approaches like BERT (Bidirectional Encoder Representation from Transformers). Furthermore, all the above approaches were benchmarked on the validation and the test datasets and the ROC-AUC curves are compared.
Model
ROC-AUC curve

TF-IDF/SVM

LDA/Random Forest

Word2Vec/SVM

BERT

TF-IDF/Random Forest

Ensemble Classifier*

TF-IDF/Naive Bayes

*A soft-voting classifier which relies upon two models, LDA/Random Forest and TF-IDF/SVM

  • More information about the dataset and methods? You're more than welcome to take a bit more extensive look here.

About

This project was a part of a 2-months internship at the Architecture et Fonction des Macromolécules Biologiques laboratory (AFMB, Marseille, France), hosted within the Glycogenomics team.

Acknowledgements

First, I would like to start by thanking Dr. Nicolas Terrapon for his patience, precious help and invaluable supervision, not to mention the oppurtunity that he gave me to work on such an interesting project. In addition, I would like to deeply thank Dr. Philippe Ortet for his precious ideas, wonderful insights and his guidance and expertise that helped me easily navigate and use various complex subjects throughout the projet. Last but not least, I would like to finish by thanking all the Glycogenomics team for their appreciable hospitality.

📜 License

MIT Licensed © Ghassan Dabane, 2021.

forthebadge made-with-python forthebadge ForTheBadge uses-git ForTheBadge uses-badges