The required packages are installed as a new conda environment including both R and Python dependencies with the following command:
conda env create -f requirements_conda.yml
⚠️ The use ofmamba
is faster and more stable for packages installation.
The missing R packages can be found in the "requirements_r.rda" file and can be downloaded using the following commands:
R
load("requirements_r.rda")
for (count in 1:length(installedpackages)) {
install.packages(installedpackages[count])
}
⚠️ Forreticulate
, if asked for default python virtual environment, the answer should beno
to take the default conda environment into consideration
- Set
DEBUG
toFALSE
. N_SIMULATIONS
is set to the range (1
,100
)- With
N_CPU
> 1, the parallel processing is used - The list of methods contains (
marginal
,permfit
,cpi
,cpi_rf
,gpfi
,gopfi
,dgi
andgoi
). n_samples
is set to1000
andn_featues
is set to50
rho_group
lists all the correlation strengths in this experiment (0
,0.2
,0.5
,0.8
)- Number of permutations/samples
n_perm
is set to100
- The output csv file is found in
results/results_csv
- Preparing csv files with R script
plot_simulations_all
under[AUC-type1error-power-time_bars]_blocks_100_grps.csv
- The plotting is done under
plots/plot_figure_simulations_grps.ipynb
with:Figure 1
for the Figure 2 in the main textPower + Time + Prediction scores
for the Figure 6 in the supplementFigure 1 Calibration
for the Figure 5 in the supplement
- We use
compute_simulations_groups
. - The script can be launched with the following command:
python -u compute_simulations_groups.py --n 1000 --pgrp 100 --nblocks 10 --intra 0.8 --inter 0.8 --conditional 1 --stacking 1 --f 1 --s 100 --njobs 1
--n
stands for the number of samples (Default1000
)--pgrp
stands for number of variables per group (Default100
)--nblocks
stands for the number of blocks/groups in the data structure (Default10
)--intra
stands for the intra correlation inside the groups (Default0.8
)--inter
stands for the inter correlation between the groups (Default0.8
)--conditional
stands for the use of CPI (1
) or PI (0
)--stacking
stands for the use of stacking (1
) or not (0
)--f
stands for the first point of the range (Default1
)--s
stands for the step-size i.e. range size (Default100
)--njobs
stands for the serial/parallel implementation underJoblib
(Default1
)
- The output csv file is found in
results/results_csv
under[AUC-type1error-power-time_bars]_blocks_100_groups_CPI_n_1000_p_1000_1::100_folds_2.csv
- The plotting is done under
plots/plot_figure_simulations_grps.ipynb
withCompare Stacking vs Non Stacking
for the Figure 3 in the main text
- The data are the public data from
UKBB
that needs to sign an agreement before using it (Any personal data are already removed) - The
biomarker
is set by default toage
n_jobs
stands for serial/parallel computationsk_fold_bbi
stands for the number of folds for the internal cross validation of the methodk_fold
stands for the number of folds for train/test splitting the original data
- The
$\underline{representative}$ p-value
will be 2*median(p-values) across the 10 folds - As for the
$\underline{performance}$ , it is measured on the 10% test set split per fold - The output csv file is found in
results/results_csv
underResult_UKBB_age_all_imp_10_outer_2_inner_PERF.csv
andResult_UKBB_age_all_imp_10_outer_2_inner_SIGN.csv
- The plotting is done under
plots/plot_figure_simulations_grps.ipynb
withFigure 3
for the Figure 4 in the main text