A comprehensive computational pipeline for discovering biologically distinct Alzheimer's disease subtypes through multimodal deep learning, validated across four independent cohorts (ADNI, AIBL, HABS, A4).
- Overview
- Repository Structure
- Prerequisites
- Installation
- Data Requirements
- Usage
- Output Files
- Citation
- License
This repository implements a novel framework for identifying Alzheimer's disease subtypes using:
- Multimodal Data Integration: APOE genotype, CSF biomarkers, clinical assessments, structural MRI, and PET imaging
- Deep Learning Clustering: Variational Autoencoder (VAE) for unsupervised subtype discovery
- Multi-Cohort Validation: External validation in AIBL, HABS, and A4 cohorts
- Statistical Characterization: Differential analysis, predictive modeling, and meta-analysis
- Biological Interpretation: Neuroimaging endotype characterization and clinical-MRI heterogeneity analysis
Key Features:
- 23-step end-to-end analysis pipeline
- Discovery cohort (ADNI) with 3-cohort external validation
- Random-effects meta-analysis across cohorts
- SHAP-based model interpretability
- Bootstrap validation with 1000 iterations
- Publication-ready visualizations (300 DPI)
AD_Multimodal_Study/
β
βββ Data Preprocessing (Steps 1-6) - Python
β βββ step1_preprocess_APOE.py # APOE genotype extraction
β βββ step2_preprocess_CSF.py # CSF biomarker integration
β βββ step3_preprocess_Clinical.py # Clinical cognitive scores
β βββ step4_preprocess_sMRI.py # Structural MRI features
β βββ step5_preprocess_PET.py # PET imaging quantification
β βββ step6_create_outcome.py # AD conversion outcomes
β
βββ Cohort Integration & Clustering (Steps 7-9C) - Python
β βββ step7_integrate_cohorts.py # Multimodal data integration
β βββ step8_vae_clustering.py # VAE deep clustering
β βββ step9_cross_cohort_analysis.py # Cross-cohort validation
β βββ step9B_biomarker_validation.py # Biomarker validation
β βββ step9C_enrichment_analysis.py # Pathway enrichment
β
βββ Statistical Analysis (Steps 10-13) - R
β βββ step10_differential_analysis.R # Limma differential analysis
β βββ step10B_smd_analysis.R # Standardized mean difference
β βββ step11_predictive_modeling.R # Multi-algorithm ML models
β βββ step12_cluster_signatures.R # Cluster signature visualization
β βββ step13_conversion_differential.R # Converter vs non-converter analysis
β
βββ Cluster Validation (Steps 14-15) - R
β βββ step14_consensus_clustering.R # Consensus clustering (PAC)
β βββ step14B_bootstrap_validation.R # Bootstrap stability (ARI, Jaccard)
β βββ step15_cross_modal_validation.R # CSF & MRI validation
β
βββ External Validation (Steps 16-18) - R
β βββ step16_habs_validation.R # HABS cohort validation
β βββ step17_meta_analysis.R # Random-effects meta-analysis
β βββ step18_shap_analysis.R # SHAP explainability
β
βββ Discovery-Validation Chain (Steps 19-21) - R
β βββ step19_adni_discovery.R # ADNI discovery & classifier
β βββ step20_aibl_preprocessing.R # AIBL validation preprocessing
β βββ step21_a4_validation.R # A4 large-sample validation
β
βββ Biological Characterization (Steps 22-23) - R
β βββ step22_subtype_naming.R # Biological nomenclature (HP/CD/TAD)
β βββ step23_neuroimaging_endotypes.R # Clinical-MRI heterogeneity
β
βββ Integrated Scripts (Recommended)
β βββ step10_differential_analysis_INTEGRATED.R # Limma + SMD combined
β βββ step14_cluster_validation_INTEGRATED.R # Consensus + Bootstrap combined
β βββ step17_meta_analysis_NEW.R # Enhanced meta-analysis
β βββ step23_neuroimaging_endotypes_GITHUB.R # GitHub-ready endotype analysis
β
βββ Documentation
β βββ README.md # This file
β βββ CODE_COMPLETENESS_ASSESSMENT.md # Comprehensive code review
β βββ GITHUB_SUBMISSION_CHECKLIST.md # Pre-submission checklist
β βββ INTEGRATED_SCRIPTS_SUMMARY.md # Integration documentation
β βββ STEP23_GITHUB_READY_NOTES.md # Step 23 specific notes
β
βββ Supporting Files
βββ requirements.txt # Python dependencies
βββ ALL_STEPS_GITHUB_READY.md # Complete file inventory
- Python: 3.8 or higher
- R: 4.0 or higher
- Operating System: Windows, macOS, or Linux
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
tensorflow>=2.8.0
keras>=2.8.0
matplotlib>=3.5.0
seaborn>=0.11.0# Data manipulation
dplyr, tidyverse
# Statistical analysis
limma, tableone, survival, survminer
# Machine learning
randomForest, caret, glmnet, xgboost
# Clustering & validation
ConsensusClusterPlus, cluster, mclust, mice
# Meta-analysis
meta, metafor
# Interpretability
shap (via reticulate)
# Visualization
ggplot2, pheatmap, patchwork, RColorBrewer, ggrepelgit clone https://github.com/YOUR_USERNAME/AD_Multimodal_Study.git
cd AD_Multimodal_Studypip install -r requirements.txt# In R console
install.packages(c("dplyr", "tidyverse", "ggplot2", "survival", "survminer",
"randomForest", "caret", "glmnet", "xgboost",
"pheatmap", "patchwork", "RColorBrewer", "ggrepel",
"tableone", "cluster", "mclust", "mice"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("limma", "ConsensusClusterPlus"))
# Meta-analysis packages
install.packages(c("meta", "metafor"))All input files should be CSV format with the following structure:
- Required columns:
ID,APOE_Genotype - Example: ID=001, APOE_Genotype=E3/E4
- Required columns:
ID,ABETA,TAU,PTAU - Optional: Additional CSF markers
- Required columns:
ID,ADAS13,CDRSB,MMSE_Baseline,Age,Gender,Education - Optional: FAQTOTAL, RAVLT scores
- Required columns:
ID, MRI features (e.g.,ST102TA,ST103CV, etc.) - Format: FreeSurfer ROI measurements
- Required columns:
ID, regional SUVR values - Format: Normalized to reference region
- Required columns:
ID,Visit_Date,CDR,AD_Conversion - Purpose: Outcome generation
data/
βββ ADNI/
β βββ ApoE_Genotyping_Results.csv
β βββ CSF_Biomarkers.csv
β βββ Clinical_Assessments.csv
β βββ FreeSurfer_ROI.csv
β βββ PET_SUVR_Data.csv
β βββ CDR_Longitudinal.csv
β
βββ AIBL/
β βββ AIBL_Baseline_Integrated.csv
β
βββ HABS/
β βββ HABS_Baseline_Integrated.csv
β
βββ A4/
βββ A4_Baseline_Integrated.csv
# Run preprocessing scripts sequentially
python step1_preprocess_APOE.py
python step2_preprocess_CSF.py
python step3_preprocess_Clinical.py
python step4_preprocess_sMRI.py
python step5_preprocess_PET.py
python step6_create_outcome.pyOutput: Individual modality CSV files (e.g., APOE_genetics.csv, metabolites.csv)
python step7_integrate_cohorts.py
python step8_vae_clustering.py
python step9_cross_cohort_analysis.py
python step9B_biomarker_validation.pyOutput:
Cohort_A_Integrated.csv,Cohort_B_Integrated.csvVAE_latent_embeddings.csv,cluster_results.csv
# Differential analysis (use integrated version)
source("step10_differential_analysis_INTEGRATED.R")
# Predictive modeling
source("step11_predictive_modeling.R")
# Cluster signatures
source("step12_cluster_signatures.R")
# Conversion analysis
source("step13_conversion_differential.R")Output:
- Differential expression results (
DiffExpr_*.csv) - ML model performance (
Model_Performance_Comparison.csv) - Signature heatmaps
# Cluster validation (use integrated version)
source("step14_cluster_validation_INTEGRATED.R")
# Cross-modal validation
source("step15_cross_modal_validation.R")
# External cohort validation
source("step16_habs_validation.R")Output:
- Stability metrics (ARI, Jaccard, Silhouette)
- Validation AUC, confusion matrices
# Enhanced meta-analysis (use new version)
source("step17_meta_analysis_NEW.R")
# SHAP explainability
source("step18_shap_analysis.R")Output:
- Forest plots, funnel plots
- SHAP feature importance plots
# ADNI discovery
source("step19_adni_discovery.R")
# AIBL validation
source("step20_aibl_preprocessing.R")
# A4 validation
source("step21_a4_validation.R")Output:
- Trained classifier (
ADNI_Classifier.rds) - Validation survival curves
# Subtype biological naming
source("step22_subtype_naming.R")
# Neuroimaging endotypes (use GitHub version)
source("step23_neuroimaging_endotypes_GITHUB.R")Output:
- Subtype naming tables (HP, CD, TAD)
- Clinical-MRI heterogeneity analysis
cluster_results.csv: Final cluster assignmentsVAE_latent_embeddings.csv: Latent space representationsFinal_Consensus_Clusters_K3.csv: Consensus clustering results
DiffExpr_Clinical_All.csv: All clinical featuresDiffExpr_sMRI_Significant.csv: Significant MRI featuresSMD_All_Features.csv: Standardized mean differences
Model_Performance_Comparison.csv: Multi-algorithm performanceADNI_Classifier.rds: Trained random forest modelSHAP_Feature_Importance.csv: Feature importance rankings
Bootstrap_Stability_Summary.csv: ARI, Jaccard indicesExternal_Validation_Performance.csv: AIBL, HABS, A4 AUCMeta_Analysis_Results.csv: Pooled effect sizes
Volcano_*.png: Volcano plots for each modalityHeatmap_*.png: Clustered heatmapsFigure_Main_Combined.pdf: 4-panel endotype characterizationFig1_Forest_Plot.png: Meta-analysis forest plot
Subtype_Naming_Tables.csv: HP/CD/TAD nomenclatureClinical_Homogeneity_Complete.csv: Clinical feature analysisMRI_Heterogeneity_Complete.csv: MRI feature analysis
For cleaner workflow, use the integrated versions:
# Instead of step10 + step10B separately
source("step10_differential_analysis_INTEGRATED.R")
# Instead of step14 + step14B + step14_bootstrap separately
source("step14_cluster_validation_INTEGRATED.R")
# Enhanced meta-analysis with sensitivity analysis
source("step17_meta_analysis_NEW.R")
# GitHub-ready endotype analysis
source("step23_neuroimaging_endotypes_GITHUB.R")For large datasets, enable parallel processing in R scripts:
# In step14_cluster_validation_INTEGRATED.R
library(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
# ... parallel bootstrap code ...
stopCluster(cl)To analyze your own cohort:
- Format data according to specifications above
- Run steps 1-6 for preprocessing
- Run step 7-8 for integration and clustering
- Apply the trained classifier from step19:
# Load your data
new_cohort <- read.csv("Your_Cohort_Data.csv")
# Load trained classifier
classifier <- readRDS("ADNI_Classifier.rds")
# Predict subtypes
predictions <- predict(classifier, newdata = new_cohort)All scripts use fixed random seeds for reproducibility:
- Python scripts:
np.random.seed(42) - R scripts:
set.seed(42)
To ensure reproducibility, save session information:
# At end of analysis
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")For complete reproducibility, consider using Docker:
FROM rocker/tidyverse:4.2
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# ... additional setup ...Issue 1: Missing input files
Error: File 'cluster_results.csv' not found
Solution: Ensure you run preprocessing steps (1-9) before analysis steps (10-23)
Issue 2: Package installation errors
Error: package 'limma' is not available
Solution: Install from Bioconductor:
BiocManager::install("limma")Issue 3: Memory errors in VAE clustering
MemoryError: Unable to allocate array
Solution: Reduce batch size or use fewer features in step8_vae_clustering.py
Issue 4: Convergence warnings in meta-analysis
Warning: Egger test unreliable with < 5 studies
Solution: This is expected with 3 cohorts; interpret cautiously
Typical runtime on a standard workstation (16GB RAM, 8-core CPU):
| Phase | Steps | Time | Memory |
|---|---|---|---|
| Preprocessing | 1-6 | ~10 min | < 2GB |
| VAE Clustering | 7-8 | ~30 min | 4-8GB |
| Statistical Analysis | 10-13 | ~15 min | < 4GB |
| Validation | 14-16 | ~45 min | < 4GB |
| Meta-Analysis | 17-18 | ~5 min | < 2GB |
| Discovery-Validation | 19-21 | ~20 min | < 4GB |
| Characterization | 22-23 | ~10 min | < 2GB |
| Total | 1-23 | ~2.5 hrs | < 8GB |
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit changes (
git commit -m 'Add YourFeature') - Push to branch (
git push origin feature/YourFeature) - Open a Pull Request
- Python: Follow PEP 8 style guide
- R: Follow tidyverse style guide
- All comments in English
- Include docstrings/roxygen documentation
- Add unit tests where applicable
If you use this code in your research, please cite:
@article{ADMultimodalSubtypes2025,
title={Multimodal Deep Phenotyping Reveals Biologically Distinct Alzheimer's Disease Subtypes},
author={Your Name and Collaborators},
journal={Journal Name},
year={2025},
volume={XX},
pages={XXX-XXX},
doi={10.XXXX/XXXXX}
}Code Repository:
@software{ADMultimodalCode2025,
title={AD Multimodal Subtype Discovery Pipeline},
author={Your Name},
year={2025},
publisher={GitHub},
url={https://github.com/YOUR_USERNAME/AD_Multimodal_Study}
}This project is licensed under the MIT License - see the LICENSE file for details.
- ADNI: Alzheimer's Disease Neuroimaging Initiative
- AIBL: Australian Imaging, Biomarker & Lifestyle Flagship Study
- HABS: Harvard Aging Brain Study
- A4: Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease Study
For questions or collaboration inquiries:
- Email: your.email@institution.edu
- Issues: Please use the GitHub Issues page
- Discussions: Join our GitHub Discussions
- v1.0.0 (December 2025): Initial public release
- 23-step complete pipeline
- Integrated scripts for key analyses
- Comprehensive documentation
- GitHub-ready, SCI journal compliant
π’ Active Development - This repository is actively maintained and updated.
Last Updated: December 2025
Status: Production-ready, validated across 4 independent cohorts
Code Quality: β
GitHub-ready, β
SCI journal compliant, β
No Chinese characters
β If you find this repository useful, please consider starring it!