Alzheimer's Disease Multimodal Deep Phenotyping and Subtype Discovery

A comprehensive computational pipeline for discovering biologically distinct Alzheimer's disease subtypes through multimodal deep learning, validated across four independent cohorts (ADNI, AIBL, HABS, A4).

Overview

This repository implements a novel framework for identifying Alzheimer's disease subtypes using:

Multimodal Data Integration: APOE genotype, CSF biomarkers, clinical assessments, structural MRI, and PET imaging
Deep Learning Clustering: Variational Autoencoder (VAE) for unsupervised subtype discovery
Multi-Cohort Validation: External validation in AIBL, HABS, and A4 cohorts
Statistical Characterization: Differential analysis, predictive modeling, and meta-analysis
Biological Interpretation: Neuroimaging endotype characterization and clinical-MRI heterogeneity analysis

Key Features:

23-step end-to-end analysis pipeline
Discovery cohort (ADNI) with 3-cohort external validation
Random-effects meta-analysis across cohorts
SHAP-based model interpretability
Bootstrap validation with 1000 iterations
Publication-ready visualizations (300 DPI)

Repository Structure

AD_Multimodal_Study/
│
├── Data Preprocessing (Steps 1-6) - Python
│   ├── step1_preprocess_APOE.py              # APOE genotype extraction
│   ├── step2_preprocess_CSF.py               # CSF biomarker integration
│   ├── step3_preprocess_Clinical.py          # Clinical cognitive scores
│   ├── step4_preprocess_sMRI.py              # Structural MRI features
│   ├── step5_preprocess_PET.py               # PET imaging quantification
│   └── step6_create_outcome.py               # AD conversion outcomes
│
├── Cohort Integration & Clustering (Steps 7-9C) - Python
│   ├── step7_integrate_cohorts.py            # Multimodal data integration
│   ├── step8_vae_clustering.py               # VAE deep clustering
│   ├── step9_cross_cohort_analysis.py        # Cross-cohort validation
│   ├── step9B_biomarker_validation.py        # Biomarker validation
│   └── step9C_enrichment_analysis.py         # Pathway enrichment
│
├── Statistical Analysis (Steps 10-13) - R
│   ├── step10_differential_analysis.R        # Limma differential analysis
│   ├── step10B_smd_analysis.R                # Standardized mean difference
│   ├── step11_predictive_modeling.R          # Multi-algorithm ML models
│   ├── step12_cluster_signatures.R           # Cluster signature visualization
│   └── step13_conversion_differential.R      # Converter vs non-converter analysis
│
├── Cluster Validation (Steps 14-15) - R
│   ├── step14_consensus_clustering.R         # Consensus clustering (PAC)
│   ├── step14B_bootstrap_validation.R        # Bootstrap stability (ARI, Jaccard)
│   └── step15_cross_modal_validation.R       # CSF & MRI validation
│
├── External Validation (Steps 16-18) - R
│   ├── step16_habs_validation.R              # HABS cohort validation
│   ├── step17_meta_analysis.R                # Random-effects meta-analysis
│   └── step18_shap_analysis.R                # SHAP explainability
│
├── Discovery-Validation Chain (Steps 19-21) - R
│   ├── step19_adni_discovery.R               # ADNI discovery & classifier
│   ├── step20_aibl_preprocessing.R           # AIBL validation preprocessing
│   └── step21_a4_validation.R                # A4 large-sample validation
│
├── Biological Characterization (Steps 22-23) - R
│   ├── step22_subtype_naming.R               # Biological nomenclature (HP/CD/TAD)
│   └── step23_neuroimaging_endotypes.R       # Clinical-MRI heterogeneity
│
├── Integrated Scripts (Recommended)
│   ├── step10_differential_analysis_INTEGRATED.R   # Limma + SMD combined
│   ├── step14_cluster_validation_INTEGRATED.R      # Consensus + Bootstrap combined
│   ├── step17_meta_analysis_NEW.R                  # Enhanced meta-analysis
│   └── step23_neuroimaging_endotypes_GITHUB.R      # GitHub-ready endotype analysis
│
├── Documentation
│   ├── README.md                             # This file
│   ├── CODE_COMPLETENESS_ASSESSMENT.md       # Comprehensive code review
│   ├── GITHUB_SUBMISSION_CHECKLIST.md        # Pre-submission checklist
│   ├── INTEGRATED_SCRIPTS_SUMMARY.md         # Integration documentation
│   └── STEP23_GITHUB_READY_NOTES.md          # Step 23 specific notes
│
└── Supporting Files
    ├── requirements.txt                       # Python dependencies
    └── ALL_STEPS_GITHUB_READY.md             # Complete file inventory

Prerequisites

Software Requirements

Python: 3.8 or higher
R: 4.0 or higher
Operating System: Windows, macOS, or Linux

Python Dependencies

numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
tensorflow>=2.8.0
keras>=2.8.0
matplotlib>=3.5.0
seaborn>=0.11.0

R Packages

# Data manipulation
dplyr, tidyverse

# Statistical analysis
limma, tableone, survival, survminer

# Machine learning
randomForest, caret, glmnet, xgboost

# Clustering & validation
ConsensusClusterPlus, cluster, mclust, mice

# Meta-analysis
meta, metafor

# Interpretability
shap (via reticulate)

# Visualization
ggplot2, pheatmap, patchwork, RColorBrewer, ggrepel

Installation

1. Clone the Repository

git clone https://github.com/YOUR_USERNAME/AD_Multimodal_Study.git
cd AD_Multimodal_Study

2. Install Python Dependencies

pip install -r requirements.txt

3. Install R Packages

# In R console
install.packages(c("dplyr", "tidyverse", "ggplot2", "survival", "survminer",
                   "randomForest", "caret", "glmnet", "xgboost",
                   "pheatmap", "patchwork", "RColorBrewer", "ggrepel",
                   "tableone", "cluster", "mclust", "mice"))

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("limma", "ConsensusClusterPlus"))

# Meta-analysis packages
install.packages(c("meta", "metafor"))

Data Requirements

Input Data Format

All input files should be CSV format with the following structure:

1. APOE Genotyping (`ApoE_Genotyping_Results.csv`)

Required columns: ID, APOE_Genotype
Example: ID=001, APOE_Genotype=E3/E4

2. CSF Biomarkers (`CSF_*.csv`)

Required columns: ID, ABETA, TAU, PTAU
Optional: Additional CSF markers

3. Clinical Data (`Clinical_Assessments.csv`)

Required columns: ID, ADAS13, CDRSB, MMSE_Baseline, Age, Gender, Education
Optional: FAQTOTAL, RAVLT scores

4. Structural MRI (`FreeSurfer_*.csv`)

Required columns: ID, MRI features (e.g., ST102TA, ST103CV, etc.)
Format: FreeSurfer ROI measurements

5. PET Imaging (`PET_SUVR_Data.csv`)

Required columns: ID, regional SUVR values
Format: Normalized to reference region

6. Longitudinal Data (`CDR_Longitudinal.csv`)

Required columns: ID, Visit_Date, CDR, AD_Conversion
Purpose: Outcome generation

Data Organization

data/
├── ADNI/
│   ├── ApoE_Genotyping_Results.csv
│   ├── CSF_Biomarkers.csv
│   ├── Clinical_Assessments.csv
│   ├── FreeSurfer_ROI.csv
│   ├── PET_SUVR_Data.csv
│   └── CDR_Longitudinal.csv
│
├── AIBL/
│   └── AIBL_Baseline_Integrated.csv
│
├── HABS/
│   └── HABS_Baseline_Integrated.csv
│
└── A4/
    └── A4_Baseline_Integrated.csv

Usage

Quick Start (Recommended Order)

Phase 1: Data Preprocessing (Python)

# Run preprocessing scripts sequentially
python step1_preprocess_APOE.py
python step2_preprocess_CSF.py
python step3_preprocess_Clinical.py
python step4_preprocess_sMRI.py
python step5_preprocess_PET.py
python step6_create_outcome.py

Output: Individual modality CSV files (e.g., APOE_genetics.csv, metabolites.csv)

Phase 2: Cohort Integration & Clustering (Python)

python step7_integrate_cohorts.py
python step8_vae_clustering.py
python step9_cross_cohort_analysis.py
python step9B_biomarker_validation.py

Output:

Cohort_A_Integrated.csv, Cohort_B_Integrated.csv
VAE_latent_embeddings.csv, cluster_results.csv

Phase 3: Statistical Analysis (R)

# Differential analysis (use integrated version)
source("step10_differential_analysis_INTEGRATED.R")

# Predictive modeling
source("step11_predictive_modeling.R")

# Cluster signatures
source("step12_cluster_signatures.R")

# Conversion analysis
source("step13_conversion_differential.R")

Output:

Differential expression results (DiffExpr_*.csv)
ML model performance (Model_Performance_Comparison.csv)
Signature heatmaps

Phase 4: Validation (R)

# Cluster validation (use integrated version)
source("step14_cluster_validation_INTEGRATED.R")

# Cross-modal validation
source("step15_cross_modal_validation.R")

# External cohort validation
source("step16_habs_validation.R")

Output:

Stability metrics (ARI, Jaccard, Silhouette)
Validation AUC, confusion matrices

Phase 5: Meta-Analysis & Interpretability (R)

# Enhanced meta-analysis (use new version)
source("step17_meta_analysis_NEW.R")

# SHAP explainability
source("step18_shap_analysis.R")

Output:

Forest plots, funnel plots
SHAP feature importance plots

Phase 6: Discovery-Validation Chain (R)

# ADNI discovery
source("step19_adni_discovery.R")

# AIBL validation
source("step20_aibl_preprocessing.R")

# A4 validation
source("step21_a4_validation.R")

Output:

Trained classifier (ADNI_Classifier.rds)
Validation survival curves

Phase 7: Biological Characterization (R)

# Subtype biological naming
source("step22_subtype_naming.R")

# Neuroimaging endotypes (use GitHub version)
source("step23_neuroimaging_endotypes_GITHUB.R")

Output:

Subtype naming tables (HP, CD, TAD)
Clinical-MRI heterogeneity analysis

Output Files

Key Output Categories

1. Cluster Results

cluster_results.csv: Final cluster assignments
VAE_latent_embeddings.csv: Latent space representations
Final_Consensus_Clusters_K3.csv: Consensus clustering results

2. Differential Analysis

DiffExpr_Clinical_All.csv: All clinical features
DiffExpr_sMRI_Significant.csv: Significant MRI features
SMD_All_Features.csv: Standardized mean differences

3. Predictive Models

Model_Performance_Comparison.csv: Multi-algorithm performance
ADNI_Classifier.rds: Trained random forest model
SHAP_Feature_Importance.csv: Feature importance rankings

4. Validation Metrics

Bootstrap_Stability_Summary.csv: ARI, Jaccard indices
External_Validation_Performance.csv: AIBL, HABS, A4 AUC
Meta_Analysis_Results.csv: Pooled effect sizes

5. Visualizations (300 DPI)

Volcano_*.png: Volcano plots for each modality
Heatmap_*.png: Clustered heatmaps
Figure_Main_Combined.pdf: 4-panel endotype characterization
Fig1_Forest_Plot.png: Meta-analysis forest plot

6. Biological Characterization

Subtype_Naming_Tables.csv: HP/CD/TAD nomenclature
Clinical_Homogeneity_Complete.csv: Clinical feature analysis
MRI_Heterogeneity_Complete.csv: MRI feature analysis

Advanced Usage

Option 1: Use Integrated Scripts (Recommended)

For cleaner workflow, use the integrated versions:

# Instead of step10 + step10B separately
source("step10_differential_analysis_INTEGRATED.R")

# Instead of step14 + step14B + step14_bootstrap separately
source("step14_cluster_validation_INTEGRATED.R")

# Enhanced meta-analysis with sensitivity analysis
source("step17_meta_analysis_NEW.R")

# GitHub-ready endotype analysis
source("step23_neuroimaging_endotypes_GITHUB.R")

Option 2: Parallel Processing

For large datasets, enable parallel processing in R scripts:

# In step14_cluster_validation_INTEGRATED.R
library(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
# ... parallel bootstrap code ...
stopCluster(cl)

Option 3: Custom Cohort Analysis

To analyze your own cohort:

Format data according to specifications above
Run steps 1-6 for preprocessing
Run step 7-8 for integration and clustering
Apply the trained classifier from step19:

# Load your data
new_cohort <- read.csv("Your_Cohort_Data.csv")

# Load trained classifier
classifier <- readRDS("ADNI_Classifier.rds")

# Predict subtypes
predictions <- predict(classifier, newdata = new_cohort)

Reproducibility

Random Seeds

All scripts use fixed random seeds for reproducibility:

Python scripts: np.random.seed(42)
R scripts: set.seed(42)

Session Info

To ensure reproducibility, save session information:

# At end of analysis
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")

Docker Support (Optional)

For complete reproducibility, consider using Docker:

FROM rocker/tidyverse:4.2
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# ... additional setup ...

Troubleshooting

Common Issues

Issue 1: Missing input files

Error: File 'cluster_results.csv' not found

Solution: Ensure you run preprocessing steps (1-9) before analysis steps (10-23)

Issue 2: Package installation errors

Error: package 'limma' is not available

Solution: Install from Bioconductor:

BiocManager::install("limma")

Issue 3: Memory errors in VAE clustering

MemoryError: Unable to allocate array

Solution: Reduce batch size or use fewer features in step8_vae_clustering.py

Issue 4: Convergence warnings in meta-analysis

Warning: Egger test unreliable with < 5 studies

Solution: This is expected with 3 cohorts; interpret cautiously

Performance Benchmarks

Typical runtime on a standard workstation (16GB RAM, 8-core CPU):

Phase	Steps	Time	Memory
Preprocessing	1-6	~10 min	< 2GB
VAE Clustering	7-8	~30 min	4-8GB
Statistical Analysis	10-13	~15 min	< 4GB
Validation	14-16	~45 min	< 4GB
Meta-Analysis	17-18	~5 min	< 2GB
Discovery-Validation	19-21	~20 min	< 4GB
Characterization	22-23	~10 min	< 2GB
Total	1-23	~2.5 hrs	< 8GB

Contributing

We welcome contributions! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/YourFeature)
Commit changes (git commit -m 'Add YourFeature')
Push to branch (git push origin feature/YourFeature)
Open a Pull Request

Code Standards

Python: Follow PEP 8 style guide
R: Follow tidyverse style guide
All comments in English
Include docstrings/roxygen documentation
Add unit tests where applicable

Citation

If you use this code in your research, please cite:

@article{ADMultimodalSubtypes2025,
  title={Multimodal Deep Phenotyping Reveals Biologically Distinct Alzheimer's Disease Subtypes},
  author={Your Name and Collaborators},
  journal={Journal Name},
  year={2025},
  volume={XX},
  pages={XXX-XXX},
  doi={10.XXXX/XXXXX}
}

Code Repository:

@software{ADMultimodalCode2025,
  title={AD Multimodal Subtype Discovery Pipeline},
  author={Your Name},
  year={2025},
  publisher={GitHub},
  url={https://github.com/YOUR_USERNAME/AD_Multimodal_Study}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

ADNI: Alzheimer's Disease Neuroimaging Initiative
AIBL: Australian Imaging, Biomarker & Lifestyle Flagship Study
HABS: Harvard Aging Brain Study
A4: Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease Study

Contact

For questions or collaboration inquiries:

Email: your.email@institution.edu
Issues: Please use the GitHub Issues page
Discussions: Join our GitHub Discussions

Version History

v1.0.0 (December 2025): Initial public release
- 23-step complete pipeline
- Integrated scripts for key analyses
- Comprehensive documentation
- GitHub-ready, SCI journal compliant

Project Status

🟢 Active Development - This repository is actively maintained and updated.

Last Updated: December 2025
Status: Production-ready, validated across 4 independent cohorts
Code Quality: ✅ GitHub-ready, ✅ SCI journal compliant, ✅ No Chinese characters

⭐ If you find this repository useful, please consider starring it!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AI_vs_Clinician_Analysis		AI_vs_Clinician_Analysis
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
step10_differential_analysis.R		step10_differential_analysis.R
step11_predictive_modeling.R		step11_predictive_modeling.R
step12_cluster_signatures.R		step12_cluster_signatures.R
step13_conversion_differential.R		step13_conversion_differential.R
step14_cluster_validation_INTEGRATED.R		step14_cluster_validation_INTEGRATED.R
step15_cross_modal_validation.R		step15_cross_modal_validation.R
step16_habs_validation.R		step16_habs_validation.R
step17_shap_analysis.R		step17_shap_analysis.R
step18_meta_analysis.R		step18_meta_analysis.R
step19_adni_discovery.R		step19_adni_discovery.R
step1_preprocess_APOE.py		step1_preprocess_APOE.py
step20_aibl_preprocessing.R		step20_aibl_preprocessing.R
step21_a4_validation.R		step21_a4_validation.R
step22_subtype_naming.R		step22_subtype_naming.R
step23_neuroimaging_endotypes.R		step23_neuroimaging_endotypes.R
step2_preprocess_CSF.py		step2_preprocess_CSF.py
step3_preprocess_Clinical.py		step3_preprocess_Clinical.py
step4_preprocess_sMRI.py		step4_preprocess_sMRI.py
step5_preprocess_PET.py		step5_preprocess_PET.py
step6_create_outcome.py		step6_create_outcome.py
step7_integrate_cohorts.py		step7_integrate_cohorts.py
step8_vae_clustering.py		step8_vae_clustering.py
step9B_biomarker_validation.py		step9B_biomarker_validation.py
step9C_enrichment_analysis.py		step9C_enrichment_analysis.py
step9_cross_cohort_analysis.py		step9_cross_cohort_analysis.py

License

vshi2316/AD_Multimodal_Study

Folders and files

Latest commit

History

Repository files navigation

Alzheimer's Disease Multimodal Deep Phenotyping and Subtype Discovery

Table of Contents

Overview

Repository Structure

Prerequisites

Software Requirements

Python Dependencies

R Packages

Installation

1. Clone the Repository

2. Install Python Dependencies

3. Install R Packages

Data Requirements

Input Data Format

1. APOE Genotyping (ApoE_Genotyping_Results.csv)

2. CSF Biomarkers (CSF_*.csv)

3. Clinical Data (Clinical_Assessments.csv)

4. Structural MRI (FreeSurfer_*.csv)

5. PET Imaging (PET_SUVR_Data.csv)

6. Longitudinal Data (CDR_Longitudinal.csv)

Data Organization

Usage

Quick Start (Recommended Order)

Phase 1: Data Preprocessing (Python)

Phase 2: Cohort Integration & Clustering (Python)

Phase 3: Statistical Analysis (R)

Phase 4: Validation (R)

Phase 5: Meta-Analysis & Interpretability (R)

Phase 6: Discovery-Validation Chain (R)

Phase 7: Biological Characterization (R)

Output Files

Key Output Categories

1. Cluster Results

2. Differential Analysis

3. Predictive Models

4. Validation Metrics

5. Visualizations (300 DPI)

6. Biological Characterization

Advanced Usage

Option 1: Use Integrated Scripts (Recommended)

Option 2: Parallel Processing

Option 3: Custom Cohort Analysis

Reproducibility

Random Seeds

Session Info

Docker Support (Optional)

Troubleshooting

Common Issues

Performance Benchmarks

Contributing

Code Standards

Citation

License

Acknowledgments

Contact

Version History

Project Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. APOE Genotyping (`ApoE_Genotyping_Results.csv`)

2. CSF Biomarkers (`CSF_*.csv`)

3. Clinical Data (`Clinical_Assessments.csv`)

4. Structural MRI (`FreeSurfer_*.csv`)

5. PET Imaging (`PET_SUVR_Data.csv`)

6. Longitudinal Data (`CDR_Longitudinal.csv`)

Packages