Releases: SOCR/CBDA
Guide to Compressive Big Data Analytics [CBDA]
CBDA R Package Installation
The version 1.0.0 of the CBDA package can be downloaded and installed with the following command:
install.packages("CBDA",repos = 'https://cran.r-project.org/')
The historical CBDA stats (since publication in April 16 2018 on CRAN) are shown in the Figure below .
A comparison with some other similar packages for the month of November 2018 is shown below.
The documentation and vignettes, as well as the source and binary files can be found on CRAN.
The binary and the source files for the CBDA R package can also be found here. You can install them via the following commands.
# Installation from the Windows binary (recommended for Windows systems)
install.packages("/filepath/CBDA_1.0.0.zip", repos = NULL, type = "win.binary")
# Installation from the source (recommended for Macs and Linux systems)
install.packages("/filepath/CBDA_1.0.0.tar.gz", repos = NULL, type = "source")
The necessary packages to run the CBDA algortihm are installed automatically at installation. However, they can also be installed/attached by launching the CBDA_initialization() function (see example in the R chunk below). If the parameter install is set to TRUE (by default it's set to FALSE), then the CBDA_initialization() function installs (if needed) and attaches all the necessary packages to run the CBDA package v1.0.0. This function can be run before any production run or test. The list of packages can pe personalized to comprise extra packages needed for an expanded SL.library or for other needs by the user. The output shows a table (see Figure below) where for each package a TRUE or FALSE is displayed. Thus the necessary steps can be pursued in case some package has a FALSE.
N.B.: to eliminate a warning in Windows due to the "doMC" package not available (it was intended for Mac), install the "doMC" with the following command "install.packages("doMC", repos="http://R-Forge.R-project.org")"
Memory and storage limits to run CBDA
See Memory limits under different OS for various limitations on memory needs while running R. As far as CBDA is concerned, a CBDA object can be up to 200-300 Mb. The default limits are acceptable to run the CBDA algorithm.
To know or change the memory allocation, the memory.limit() function reports or increases the limit in force on the total allocation.
memory.limit(50000) # to allocate 50Gb of memory
However, the space needed to save all the workspaces may need to be as large as 1-5 Gb, depending on the number of subsamples. We are working on an new CBDA implementation that reduces the storage constraints.
Background
This document illustrates the first release of the CBDA protocol as described in the manuscript under review entitled "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies" by Simeone Marino, Jiachen Xu, Yi Zhao, Nina Zhou, Yiwang Zhou, Ivo D. Dinov. University of Michigan, Ann Arbor.
The CBDA protocol has been developed in the R environment. Since a large number of smaller training sets are needed for the convergence of the protocol, we created a workflow that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of independent instances/jobs of the CBDA protocol. The methods, software and protocols developed here are openly shared on our GitHub repository. All software, workflows, and datasets are publicly accessible. The CBDA protocol steps are illustrated in Figure 1.
This is an introduction to version control and LONI
Version Control and LONI.
This is a simple and clear Introduction to the main algorithm used in the CBDA protocol, namely SuperLearner_Intro.
Main function of the CBDA package -- CBDA()
The CBDA package comprises several functions. The main function is CBDA() that has
all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp].
We assume that the Big Data is already clean and harmonized.
After the necessary data wrangling (i.e., imputation, normalization and rebalancing),
an ensemble predictor (i.e., SuperLearner) is applied to each subsample for training/learning.
The list of algorithms (or wrappers) available from the SuperLearner can be displayed by typing listWrappers().
By default, the CBDA package operates with the following algorithms (set with their default values,
as described in the SuperLearner package):
-
SL.glm: wrapper for generalized linear models via glm()
-
SL.xgboost: wrapper that supports the Extreme Gradient Boosting package for SuperLearnering, which is a variant of gradient boosted machines (GBM).
-
SL.glmnet: wrapper for penalized regression using elastic net.
-
SL.svm: wrapper for Support Vector Machine
-
SL.randomForest: wrapper that supports RandomForest
-
SL.bartMachine: wrapper that supports bayesian additive regression trees via the bartMachine package
The ensemble predictive model is then validated on a fraction alpha of the Big Data.
Each subsample generates a predictive model that is ranked based on performance metrics
(e.g., Mean Square Error-MSE and Accuracy) during the first validation step.
CBDA input specifications
The array of input specifications comprises the following labels (in the square brackets is the default value):
-
Ytemp This is the output variable (vector) in the original Big Data
-
Xtemp This is the input variable (matrix) in the original Big Data
-
label This is the label appended to RData workspaces generated within the CBDA calls [= "CBDA_package_test"]
-
alpha Percentage of the Big Data to hold off for Validation [= 0.2]
-
Kcol_min Lower bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) [= 5]
-
Kcol_max Upper bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) [= 15]
-
Nrow_min Lower bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) [= 30]
-
Nrow_max Upper bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) [= 50]
-
misValperc Percentage of missing values to introduce in BigData (used just for testing, to mimic real cases). [= 0]
-
M Number of the BigData subsets on which perform Knockoff Filtering and SuperLearner feature mining [= 3000]
-
N_cores Number of Cores to use in the parallel implementation (default is set to 1 core) [= 1, not multicore enabled]
-
top Top predictions to select out of the M. This must be < M. [= 1000]
-
workspace_directory Directory where the results and workspaces are saved [= getwd()]
-
max_covs Top features to include in the Validation Step where nested models are tested [= 100]
-
min_covs Minimum number of top features to include in the initial model for the Validation Step. It must be > 2 [= 5]
Sampling ranges for cases (CSR - Cases Sampling Range) and features (FSR - Feature Sampling Range) are then defined as follow:
-
FSR = [min_FSR, max_FSR]
-
CSR = [min_CSR, max_CSR]
Secondary CBDA functions
After all the M subsamples have been generated and each predictive model computed,
the CBDA function CBDA() calls 4 more functions to perform, respectively:
-
CONSOLIDATION (i.e., CBDA_Consolidation()) and ranking of the results where
the top predictive models are selected (top) and the more frequent features (BEST) are ranked
and displayed as well -
VALIDATION (i.e., CBDA_Validation()) on the top ranked features (i.e., up to "max_covs"
number of features) where nested ensemble predictive models are generated in a bottom-up fashion -
Implementation of STOPPING CRITERIA (i.e., CBDA_Stopping_Criteria()) for the best/optimal ensemble predictive model (to avoid overfitting)
-
CLEAN UP (i.e., CBDA_CleanUp()) step for deleting unnecessary workspaces generated by the CBDA protocol.
At the end of successfull CBDA() call, 2 workspaces are generated:
-
CBDA_input_specs_label_light.RData where most of the results of the M subsamples are saved
-
CBDA_input_specs_label_light_VALIDATION.RData where most of the results of the top-ranked subsamples are saved
Throughout the execution of the CBDA protocol, a workspace with the main CBDA specifications is created and loaded whenever necessary (e.g., "CBDA_label_info.RData").
CBDA object
A CBDA object is created (e.g., CBDA_object <- CBDA()) with the following data:
-
LearningTable: information on the top features selected and the correspondent predictive
models' performances during the learning/training step -
...
CBDA protocol
CBDA protocol
(as described in the manuscript "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies")
Some useful information
This is the first release of the CBDA protocol as described in the manuscript "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies".
The CBDA protocol has been developed in the R environment. Since a large number of smaller training sets are needed for the convergence of the protocol, we created a workflow that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of independent instances/jobs of the CBDA protocol. The methods, software and protocols developed here are openly shared on our GitHub repository. All software, workflows, and datasets are publicly accessible. The CBDA protocol steps are illustrated in Figure 1.
This is an introduction to version control and LONI
Version Control and LONI.
This is a simple and clear Introduction to the main algorithm used in the CBDA protocol, namely SuperLearner_Intro.
This is a simple and clear Introduction to the main steps of the CBDA protocol implemented in R CBDA_intro.
Datasets
The protocol has been applied to 3 different datasets: 2 synthetic ones (Null and Binomial) and a real one (ADNI dataset). The synthetic datasets all have the same number of cases (i.e., 300), but they differ in the number of features. The binomial datasets all have only 10 "true" predictive features, while the null datasets have none.
The Null datasets are organized as follow:
i) Null dataset: # of features = 100
ii) Null dataset 3: # of features = 900
iii) Null dataset 5: # of features = 1500
The Binomial datasets are organized as follow:
i) Binomial dataset: # of features = 100, with true features (10,20,30,40,50,60,70,80,90,100)
ii) Binomial dataset 3: # of features = 900, with true features (1,100,200,300,400,500,600,700,800,900)
iii) Binomial dataset 5: # of features = 1500, with true features (1,100,200,400,600,800,1000,1200,1400,1500).
The complete workflow is available for the Binomial and ADNI datasets.
CBDA input specifications
The array of input specifications comprises the following labels:
-
M: number of the instances/jobs for each experiment (set to 9,000 in this study)
-
misValperc: % of missing values to introduce in the Data (just for testing, to mimic real cases)
-
min_FSR: Lower bound for the % of features/columns sampling
-
max_FSR: Upper bound for the % of features/columns sampling
-
min_CSR: Lower bound for the % of cases/rows sampling
-
max_CSR: Upper bound for the % of cases/rows sampling
The argument file has as many rows as many experiments we want to perform on a single dataset. For each dataset, we run 12 different experiments, combining the fraction of missing values (misValperc), the FSR and CSR (see Table 2). Each row of the argument file will have the following values [M , misValperc, min_FSR, max_FSR, min_CSR, max_CSR].
Sampling ranges for cases (CSR - Cases Sampling Range) and features (FSR - Feature Sampling Range) are then defined as follow:
-
FSR = [min_FSR, max_FSR]
-
CSR = [min_CSR, max_CSR] (see Table 2 in the manuscript for the options we investigated).
N.B.: to eliminate a warning due to the "doMC" package not available (it was intended for Mac), install the "doMC" with the following command "install.packages("doMC", repos="http://R-Forge.R-project.org")"
If for some reason, the necessary packages to run the CBDA algortihm are not installed/loaded, you can run the following code to install (if you don't have them already) and load the packages.
At the end it'll show a table with all the successfull steps.
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE, repos='http://cran.rstudio.com/')
sapply(pkg, require, character.only = TRUE)
}
#install.packages('package_name', dependencies=TRUE, repos='http://cran.rstudio.com/')
packages =c("ggplot2", "plyr", "colorspace","grid","data.table","VIM","MASS","Matrix",
"lme4","arm","foreach","glmnet","class","nnet","mice","missForest",
"calibrate","nnls","SuperLearner","plotrix","TeachingDemos","plotmo",
"earth","parallel","splines","gam","mi",
"BayesTree","e1071","randomForest", "Hmisc","dplyr","Amelia","bartMachine",
"knockoff","caret","smotefamily","FNN","doSNOW","doParallel","snow","xgboost")
ipak(packages)
The table shown looks like the one below.
As soon as it is launched, the progress of the CBDA algorithm is shown on the screen (see below).
CBDA complete set of results for the synthetic datasets (Null and Binomial)
Due to the large number of experiments and the many different specification, the complete set of results for the 3 binomial datasets are shown at the following links:
Binomial
Binomial 3
Binomial 5
The complete set of results for the 3 Null datasets are shown at the following links:
Null
Null 3
Null 5
The robustness test results are available here Robustness.
The complete set of results shown in Figure 2 of the manuscript are here.
This figure below shows a set of heatmaps and histograms to summarize all the experiments in the manuscript under review
ADNI results are shown here ADNI.