Skip to content

Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data

License

Notifications You must be signed in to change notification settings

mickaelleclercq/BioDiscML

Repository files navigation

BioDiscML

Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data

Short description

Automates the execution of many machine learning algorithms across various optimization and evaluation procedures to identify the best model and signature

Description

The identification of biomarker signatures in omics molecular profiling is an important challenge to predict outcomes in precision medicine context, such as patient disease susceptibility, diagnosis, prognosis and treatment response. To identify these signatures we present BioDiscML (Biomarker Discovery by Machine Learning), a tool that automates the analysis of complex biological datasets using machine learning methods. From a collection of samples and their associated characteristics, i.e. the biomarkers (e.g. gene expression, protein levels, clinico-pathological data), the goal of BioDiscML is to produce a minimal subset of biomarkers and a model that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting either categorical or continuous outcome from highly unbalanced datasets. Finally, BioDiscML also retrieves correlated biomarkers not included in the final model to better understand the signature. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. https://github.com/mickaelleclercq/BioDiscML/

See also BioDiscViz (https://gitlab.com/SBouirdene/biodiscviz.git), which includes consensus feature search, to visualize your results.

Full manuscript: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532608/

Requirements

JAVA 8 (https://www.java.com/en/download/)

Program usage

BioDiscML can be started either by creating a config file or using command line

By config file

Before executing BioDiscML, a config file must be created. Use the template to create your own. Everything is detailed in the config.conf file. Examples are available in the Test_datasets at: https://github.com/mickaelleclercq/BioDiscML/tree/master/release/Test_datasets

Train a new model

java -jar biodiscml.jar -config config.conf -train

config_myData.conf (text file) quick start content example (See release/Test_datasets folder) This configuration takes as input a file (myData.csv) and name it myProjectName. A sampling (default 2/3 for training and 1/3 for testing) is performed before classification procedure to predict the myOutcome class. One best model will be selected based on Repated Holdout performance MCC on the train set. config.conf file example:

project=myProjectName
trainFile=myData.csv
sampling=true
doClassification=true
classificationClassName=myOutcome
numberOfBestModels=1
numberOfBestModelsSortingMetric=TRAIN_TEST_RH_MCC

Resume an execution

Just add -resumeTraining=true in the command

java -jar biodiscml.jar -config config.conf -train -resumeTraining=true

Choose best model(s)

java -jar biodiscml.jar -config config.conf -bestmodel 

When training completed, stopped or still in execution, best model selection can be executed. This command reads the results file. Best models are selected based on a strategy provided in config file. You can also choose your own models manually, by opening the results file in an excel-like program and order models by your favorite metrics or filters. Each model has an identifier (modelID) you can provide to the command. Example:

java -jar biodiscml.jar -config config.conf -bestmodel modelID_1 modelID_2

Predict new data

java -jar biodiscml.jar -config config.conf -predict 

Once the best model obtained, you can predict new data or test a blind test set put aside by yourself before training. The file should be of same format and structure as the training input files. This file must contain at least all features of the selected best model signature. Features present in the newData file but absent from the signature of the model will simply be ignored during the prediction. If a class to predict column is present, BioDiscML will return errors statistics. config.conf file example:

project=myProjectName
newDataFile=myNewData.csv
doClassification=true
classificationClassName=class
modelFile=myBestModel.model

By command line

The same parameters from config file can be used to be inputed in a command line. Example:

time java -jar biodiscml.jar -train -project=myProject -excluded=excludedColumn 
-doClassification=true -classificationClassName=class -trainFile=data.csv 
-debug=true -bootstrapFolds=10 -loocv=false -cpus=10 -computeBestModel=false 
-classificationFastWay=true -ccmd=bayes.AveragedNDependenceEstimators.A1DE -F 1 -M 1.0 -W

Note that the option -ccmd must stay at the end of the command line when classifier parameters follows it.

Output files

Note: {project_name} is set in the config.conf file

  • {project_name}_a.*

A csv file and a copy in arff format (weka input format) are created here. They contain the merged data of input files with some adaptations.

  • {project_name}_b.*

A csv file and a copy in arff format (weka input format) are also created here. They are produced after feature ranking and are already a subset of {project_name}_a.*. Feature ranking is performed by Information gain for categorial class. Features having infogain <0.0001 are discarded. For numerical class, RELIEFF is used. Only best 1000 features are kept, or having a score greater than 0.0001.

  • {project_name}_c.*results.csv

Results file. Summary of all trained model with their evaluation metrics and selected attributes. Use the bestmodel command to extract models. Column index of selected attributes column correspond to the {project_name}_b.*csv file. For each model, we perform various evaluations summarized in this table:

Header Description
ID Model unique identifier. Can be passed as argument for best model selection
Classifier Machine learning classifier name
Options Classifier hyperparameters options
OptimizedValue Optimized criterion used for feature selection procedure
SearchMode Type of feature selection procedure:
- Forward Stepwise Selection (F)
- Backward stepwise selection (B)
- Forward stepwise selection and Backward stepwise elimination (FB)
- Backward stepwise selection and Forward stepwise elimination (BF)
- "top k" features.
nbrOfFeatures Number of features in the signature
TRAIN_10CV_ACC 10 fold cross validation Accuracy on train set
TRAIN_10CV_AUC 10 fold cross validation Area Under The Curve on train set
TRAIN_10CV_AUPRC 10 fold cross validation Area Under Precision Recall Curve on train set
TRAIN_10CV_SEN 10 fold cross validation Sensitivity on train set
TRAIN_10CV_SPE 10 fold cross validation Specificity on train set
TRAIN_10CV_MCC 10 fold cross validation Matthews Correlation Coefficient on train set
TRAIN_10CV_MAE 10 fold cross validation Mean Absolute Error on train set
TRAIN_10CV_BER 10 fold cross validation Balanced Error Rate on train set
TRAIN_10CV_FPR 10 fold cross validation False Positive Rate on train set
TRAIN_10CV_FNR 10 fold cross validation False Negative Rate on train set
TRAIN_10CV_PPV 10 fold cross validation Positive Predictive value on train set
TRAIN_10CV_FDR 10 fold cross validation False Discovery Rate on train set
TRAIN_10CV_Fscore 10 fold cross validation F-score on train set
TRAIN_10CV_kappa 10 fold cross validation Kappa on train set
TRAIN_matrix 10 fold cross validation Matrix on train set
TRAIN_LOOCV_ACC Leave-One-Out Cross Validation Accuracy on Train set
TRAIN_LOOCV_AUC Leave-One-Out Cross Validation Area Under The Curve on Train set
TRAIN_LOOCV_AUPRC Leave-One-Out Cross Validation Area Under Precision Recall Curve on Train set
TRAIN_LOOCV_SEN Leave-One-Out Cross Validation Sensitivity on Train set
TRAIN_LOOCV_SPE Leave-One-Out Cross Validation Specificity on Train set
TRAIN_LOOCV_MCC Leave-One-Out Cross Validation Matthews Correlation Coefficient on Train set
TRAIN_LOOCV_MAE Leave-One-Out Cross Validation Mean Absolute Error on Train set
TRAIN_LOOCV_BER Leave-One-Out Cross Validation Balanced Error Rate on Train set
TRAIN_RH_ACC Repeated holdout Accuracy on Train set
TRAIN_RH_AUC Repeated holdout Area Under The Curve on Train set
TRAIN_RH_AUPRC Repeated holdout Area Under Precision Recall Curve on Train set
TRAIN_RH_SEN Repeated holdout Sensitivity on Train set
TRAIN_RH_SPE Repeated holdout Specificity on Train set
TRAIN_RH_MCC Repeated holdout Matthews Correlation Coefficient on Train set
TRAIN_RH_MAE Repeated holdout Mean Absolute Error on Train set
TRAIN_RH_BER Repeated holdout Balanced Error Rate on Train set
TRAIN_BS_ACC Bootstrap Accuracy on Train set
TRAIN_BS_AUC Bootstrap Area Under The Curve on Train set
TRAIN_BS_AUPRC Bootstrap Area Under Precision Recall Curve on Train set
TRAIN_BS_SEN Bootstrap Sensitivity on Train set
TRAIN_BS_SPE Bootstrap Specificity on Train set
TRAIN_BS_MCC Bootstrap Matthews Correlation Coefficient on Train set
TRAIN_BS_MAE Bootstrap Mean Absolute Error on Train set
TRAIN_BS_BER Bootstrap Balanced Error Rate on Train set
TRAIN_BS.632+ Bootstrap .632+ rule
TEST_ACC Evaluation Accuracy on test set
TEST_AUC Evaluation Area Under The Curve on test set
TEST_AUPRC Evaluation Area Under Precision Recall Curve on test set
TEST_SEN Evaluation Sensitivity on test set
TEST_SPE Evaluation Specificity on test set
TEST_MCC Evaluation Matthews Correlation Coefficient on test set
TEST_MAE Evaluation Mean Absolute Error on test set
TEST_BER Evaluation Balanced Error Rate on test set
TRAIN_TEST_RH_ACC Repeated holdout Accuracy on merged Train and Test sets
TRAIN_TEST_RH_AUC Repeated holdout Area Under The Curve on merged Train and Test sets
TRAIN_TEST_RH_AUPRC Repeated holdout Area Under Precision Recall Curve on merged Train and Test sets
TRAIN_TEST_RH_SEN Repeated holdout Sensitivity on merged Train and Test sets
TRAIN_TEST_RH_SPE Repeated holdout Specificity on merged Train and Test sets
TRAIN_TEST_RH_MCC Repeated holdout Matthews Correlation Coefficient on merged Train and Test sets
TRAIN_TEST_RH_MAE Repeated holdout Mean Absolute Error on merged Train and Test sets
TRAIN_TEST_RH_BER Repeated holdout Balanced Error Rate on merged Train and Test sets
TRAIN_TEST_BS_ACC Bootstrap Accuracy on merged Train and Test sets
TRAIN_TEST_BS_AUC Bootstrap Area Under The Curve on merged Train and Test sets
TRAIN_TEST_BS_AUPRC Bootstrap Area Under Precision Recall Curve on merged Train and Test sets
TRAIN_TEST_BS_SEN Bootstrap Sensitivity on merged Train and Test sets
TRAIN_TEST_BS_SPE Bootstrap Specificity on merged Train and Test sets
TRAIN_TEST_BS_MCC Bootstrap Matthews Correlation Coefficient on merged Train and Test sets
TRAIN_TEST_BS_MAE Bootstrap Mean Absolute Error on merged Train and Test sets
TRAIN_TEST_BS_BER Bootstrap Balanced Error Rate on merged Train and Test sets
TRAIN_TEST_BS_BER_BS.632+ Bootstrap .632+ rule on merged Train and Test sets
AVG_BER Average of all calculated Balanced Error Rates
STD_BER Standard deviation of the calculated Balanced Error Rates
AVG_MAE Average of all calculated Mean Absolute Errors
STD_MAE Standard deviation of the calculated Mean Absolute Errors
AVG_MCC Average of all calculated Matthews Correlation Coefficients
STD_MCC Standard deviation of the calculated Matthews Correlation Coefficients
AttributeList Selected features. Use the option -bestmodel to generate a report and get the features' full names

Note that all columns refering to a test set will be empty if no test set have been generated or provided

  • {project_name}d.{model_name}{model_hyperparameters}_{feature_search_mode}.*details.txt

Detailled information about the model and its performance, with the full signature and correlated features.

  • {project_name}.{model_name}{model_hyperparameters}_{feature_search_mode}.*features.csv

Features retained by the model in csv. If a test set have been generated or provided, a file will be generated for: -- the train set (*.train_features.csv) -- both train and test sets (*all_features.csv)

  • {project_name}.{model_name}{model_hyperparameters}_{feature_search_mode}.*corrFeatures.csv

Features retained by the model with their correlated features in csv If a test set have been generated or provided, a file will be generated for: -- the train set (*.train_corrFeatures.csv) -- both train and test sets (*all_corrfeatures.csv)

  • {project_name}.{model_name}{model_hyperparameters}_{feature_search_mode}.*roc.png

Boostrap roc curves (EXPERIMENTAL) Must be enabled in configuration file. If a test set have been generated or provided, a roc curve picture will be generated for both train and test sets.

  • {project_name}.{model_name}{model_hyperparameters}_{feature_search_mode}.*model

Serialized model compatible with weka

About

Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published