Skip to content

vshi2316/AD_Multimodal_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Alzheimer's Disease Multimodal Deep Phenotyping and Subtype Discovery

A comprehensive computational pipeline for discovering biologically distinct Alzheimer's disease subtypes through multimodal deep learning, validated across four independent cohorts (ADNI, AIBL, HABS, A4).

License Python R


Table of Contents


Overview

This repository implements a novel framework for identifying Alzheimer's disease subtypes using:

  • Multimodal Data Integration: APOE genotype, CSF biomarkers, clinical assessments, structural MRI, and PET imaging
  • Deep Learning Clustering: Variational Autoencoder (VAE) for unsupervised subtype discovery
  • Multi-Cohort Validation: External validation in AIBL, HABS, and A4 cohorts
  • Statistical Characterization: Differential analysis, predictive modeling, and meta-analysis
  • Biological Interpretation: Neuroimaging endotype characterization and clinical-MRI heterogeneity analysis

Key Features:

  • 23-step end-to-end analysis pipeline
  • Discovery cohort (ADNI) with 3-cohort external validation
  • Random-effects meta-analysis across cohorts
  • SHAP-based model interpretability
  • Bootstrap validation with 1000 iterations
  • Publication-ready visualizations (300 DPI)

Repository Structure

AD_Multimodal_Study/
β”‚
β”œβ”€β”€ Data Preprocessing (Steps 1-6) - Python
β”‚   β”œβ”€β”€ step1_preprocess_APOE.py              # APOE genotype extraction
β”‚   β”œβ”€β”€ step2_preprocess_CSF.py               # CSF biomarker integration
β”‚   β”œβ”€β”€ step3_preprocess_Clinical.py          # Clinical cognitive scores
β”‚   β”œβ”€β”€ step4_preprocess_sMRI.py              # Structural MRI features
β”‚   β”œβ”€β”€ step5_preprocess_PET.py               # PET imaging quantification
β”‚   └── step6_create_outcome.py               # AD conversion outcomes
β”‚
β”œβ”€β”€ Cohort Integration & Clustering (Steps 7-9C) - Python
β”‚   β”œβ”€β”€ step7_integrate_cohorts.py            # Multimodal data integration
β”‚   β”œβ”€β”€ step8_vae_clustering.py               # VAE deep clustering
β”‚   β”œβ”€β”€ step9_cross_cohort_analysis.py        # Cross-cohort validation
β”‚   β”œβ”€β”€ step9B_biomarker_validation.py        # Biomarker validation
β”‚   └── step9C_enrichment_analysis.py         # Pathway enrichment
β”‚
β”œβ”€β”€ Statistical Analysis (Steps 10-13) - R
β”‚   β”œβ”€β”€ step10_differential_analysis.R        # Limma differential analysis
β”‚   β”œβ”€β”€ step10B_smd_analysis.R                # Standardized mean difference
β”‚   β”œβ”€β”€ step11_predictive_modeling.R          # Multi-algorithm ML models
β”‚   β”œβ”€β”€ step12_cluster_signatures.R           # Cluster signature visualization
β”‚   └── step13_conversion_differential.R      # Converter vs non-converter analysis
β”‚
β”œβ”€β”€ Cluster Validation (Steps 14-15) - R
β”‚   β”œβ”€β”€ step14_consensus_clustering.R         # Consensus clustering (PAC)
β”‚   β”œβ”€β”€ step14B_bootstrap_validation.R        # Bootstrap stability (ARI, Jaccard)
β”‚   └── step15_cross_modal_validation.R       # CSF & MRI validation
β”‚
β”œβ”€β”€ External Validation (Steps 16-18) - R
β”‚   β”œβ”€β”€ step16_habs_validation.R              # HABS cohort validation
β”‚   β”œβ”€β”€ step17_meta_analysis.R                # Random-effects meta-analysis
β”‚   └── step18_shap_analysis.R                # SHAP explainability
β”‚
β”œβ”€β”€ Discovery-Validation Chain (Steps 19-21) - R
β”‚   β”œβ”€β”€ step19_adni_discovery.R               # ADNI discovery & classifier
β”‚   β”œβ”€β”€ step20_aibl_preprocessing.R           # AIBL validation preprocessing
β”‚   └── step21_a4_validation.R                # A4 large-sample validation
β”‚
β”œβ”€β”€ Biological Characterization (Steps 22-23) - R
β”‚   β”œβ”€β”€ step22_subtype_naming.R               # Biological nomenclature (HP/CD/TAD)
β”‚   └── step23_neuroimaging_endotypes.R       # Clinical-MRI heterogeneity
β”‚
β”œβ”€β”€ Integrated Scripts (Recommended)
β”‚   β”œβ”€β”€ step10_differential_analysis_INTEGRATED.R   # Limma + SMD combined
β”‚   β”œβ”€β”€ step14_cluster_validation_INTEGRATED.R      # Consensus + Bootstrap combined
β”‚   β”œβ”€β”€ step17_meta_analysis_NEW.R                  # Enhanced meta-analysis
β”‚   └── step23_neuroimaging_endotypes_GITHUB.R      # GitHub-ready endotype analysis
β”‚
β”œβ”€β”€ Documentation
β”‚   β”œβ”€β”€ README.md                             # This file
β”‚   β”œβ”€β”€ CODE_COMPLETENESS_ASSESSMENT.md       # Comprehensive code review
β”‚   β”œβ”€β”€ GITHUB_SUBMISSION_CHECKLIST.md        # Pre-submission checklist
β”‚   β”œβ”€β”€ INTEGRATED_SCRIPTS_SUMMARY.md         # Integration documentation
β”‚   └── STEP23_GITHUB_READY_NOTES.md          # Step 23 specific notes
β”‚
└── Supporting Files
    β”œβ”€β”€ requirements.txt                       # Python dependencies
    └── ALL_STEPS_GITHUB_READY.md             # Complete file inventory

Prerequisites

Software Requirements

  • Python: 3.8 or higher
  • R: 4.0 or higher
  • Operating System: Windows, macOS, or Linux

Python Dependencies

numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
tensorflow>=2.8.0
keras>=2.8.0
matplotlib>=3.5.0
seaborn>=0.11.0

R Packages

# Data manipulation
dplyr, tidyverse

# Statistical analysis
limma, tableone, survival, survminer

# Machine learning
randomForest, caret, glmnet, xgboost

# Clustering & validation
ConsensusClusterPlus, cluster, mclust, mice

# Meta-analysis
meta, metafor

# Interpretability
shap (via reticulate)

# Visualization
ggplot2, pheatmap, patchwork, RColorBrewer, ggrepel

Installation

1. Clone the Repository

git clone https://github.com/YOUR_USERNAME/AD_Multimodal_Study.git
cd AD_Multimodal_Study

2. Install Python Dependencies

pip install -r requirements.txt

3. Install R Packages

# In R console
install.packages(c("dplyr", "tidyverse", "ggplot2", "survival", "survminer",
                   "randomForest", "caret", "glmnet", "xgboost",
                   "pheatmap", "patchwork", "RColorBrewer", "ggrepel",
                   "tableone", "cluster", "mclust", "mice"))

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("limma", "ConsensusClusterPlus"))

# Meta-analysis packages
install.packages(c("meta", "metafor"))

Data Requirements

Input Data Format

All input files should be CSV format with the following structure:

1. APOE Genotyping (ApoE_Genotyping_Results.csv)

  • Required columns: ID, APOE_Genotype
  • Example: ID=001, APOE_Genotype=E3/E4

2. CSF Biomarkers (CSF_*.csv)

  • Required columns: ID, ABETA, TAU, PTAU
  • Optional: Additional CSF markers

3. Clinical Data (Clinical_Assessments.csv)

  • Required columns: ID, ADAS13, CDRSB, MMSE_Baseline, Age, Gender, Education
  • Optional: FAQTOTAL, RAVLT scores

4. Structural MRI (FreeSurfer_*.csv)

  • Required columns: ID, MRI features (e.g., ST102TA, ST103CV, etc.)
  • Format: FreeSurfer ROI measurements

5. PET Imaging (PET_SUVR_Data.csv)

  • Required columns: ID, regional SUVR values
  • Format: Normalized to reference region

6. Longitudinal Data (CDR_Longitudinal.csv)

  • Required columns: ID, Visit_Date, CDR, AD_Conversion
  • Purpose: Outcome generation

Data Organization

data/
β”œβ”€β”€ ADNI/
β”‚   β”œβ”€β”€ ApoE_Genotyping_Results.csv
β”‚   β”œβ”€β”€ CSF_Biomarkers.csv
β”‚   β”œβ”€β”€ Clinical_Assessments.csv
β”‚   β”œβ”€β”€ FreeSurfer_ROI.csv
β”‚   β”œβ”€β”€ PET_SUVR_Data.csv
β”‚   └── CDR_Longitudinal.csv
β”‚
β”œβ”€β”€ AIBL/
β”‚   └── AIBL_Baseline_Integrated.csv
β”‚
β”œβ”€β”€ HABS/
β”‚   └── HABS_Baseline_Integrated.csv
β”‚
└── A4/
    └── A4_Baseline_Integrated.csv

Usage

Quick Start (Recommended Order)

Phase 1: Data Preprocessing (Python)

# Run preprocessing scripts sequentially
python step1_preprocess_APOE.py
python step2_preprocess_CSF.py
python step3_preprocess_Clinical.py
python step4_preprocess_sMRI.py
python step5_preprocess_PET.py
python step6_create_outcome.py

Output: Individual modality CSV files (e.g., APOE_genetics.csv, metabolites.csv)

Phase 2: Cohort Integration & Clustering (Python)

python step7_integrate_cohorts.py
python step8_vae_clustering.py
python step9_cross_cohort_analysis.py
python step9B_biomarker_validation.py

Output:

  • Cohort_A_Integrated.csv, Cohort_B_Integrated.csv
  • VAE_latent_embeddings.csv, cluster_results.csv

Phase 3: Statistical Analysis (R)

# Differential analysis (use integrated version)
source("step10_differential_analysis_INTEGRATED.R")

# Predictive modeling
source("step11_predictive_modeling.R")

# Cluster signatures
source("step12_cluster_signatures.R")

# Conversion analysis
source("step13_conversion_differential.R")

Output:

  • Differential expression results (DiffExpr_*.csv)
  • ML model performance (Model_Performance_Comparison.csv)
  • Signature heatmaps

Phase 4: Validation (R)

# Cluster validation (use integrated version)
source("step14_cluster_validation_INTEGRATED.R")

# Cross-modal validation
source("step15_cross_modal_validation.R")

# External cohort validation
source("step16_habs_validation.R")

Output:

  • Stability metrics (ARI, Jaccard, Silhouette)
  • Validation AUC, confusion matrices

Phase 5: Meta-Analysis & Interpretability (R)

# Enhanced meta-analysis (use new version)
source("step17_meta_analysis_NEW.R")

# SHAP explainability
source("step18_shap_analysis.R")

Output:

  • Forest plots, funnel plots
  • SHAP feature importance plots

Phase 6: Discovery-Validation Chain (R)

# ADNI discovery
source("step19_adni_discovery.R")

# AIBL validation
source("step20_aibl_preprocessing.R")

# A4 validation
source("step21_a4_validation.R")

Output:

  • Trained classifier (ADNI_Classifier.rds)
  • Validation survival curves

Phase 7: Biological Characterization (R)

# Subtype biological naming
source("step22_subtype_naming.R")

# Neuroimaging endotypes (use GitHub version)
source("step23_neuroimaging_endotypes_GITHUB.R")

Output:

  • Subtype naming tables (HP, CD, TAD)
  • Clinical-MRI heterogeneity analysis

Output Files

Key Output Categories

1. Cluster Results

  • cluster_results.csv: Final cluster assignments
  • VAE_latent_embeddings.csv: Latent space representations
  • Final_Consensus_Clusters_K3.csv: Consensus clustering results

2. Differential Analysis

  • DiffExpr_Clinical_All.csv: All clinical features
  • DiffExpr_sMRI_Significant.csv: Significant MRI features
  • SMD_All_Features.csv: Standardized mean differences

3. Predictive Models

  • Model_Performance_Comparison.csv: Multi-algorithm performance
  • ADNI_Classifier.rds: Trained random forest model
  • SHAP_Feature_Importance.csv: Feature importance rankings

4. Validation Metrics

  • Bootstrap_Stability_Summary.csv: ARI, Jaccard indices
  • External_Validation_Performance.csv: AIBL, HABS, A4 AUC
  • Meta_Analysis_Results.csv: Pooled effect sizes

5. Visualizations (300 DPI)

  • Volcano_*.png: Volcano plots for each modality
  • Heatmap_*.png: Clustered heatmaps
  • Figure_Main_Combined.pdf: 4-panel endotype characterization
  • Fig1_Forest_Plot.png: Meta-analysis forest plot

6. Biological Characterization

  • Subtype_Naming_Tables.csv: HP/CD/TAD nomenclature
  • Clinical_Homogeneity_Complete.csv: Clinical feature analysis
  • MRI_Heterogeneity_Complete.csv: MRI feature analysis

Advanced Usage

Option 1: Use Integrated Scripts (Recommended)

For cleaner workflow, use the integrated versions:

# Instead of step10 + step10B separately
source("step10_differential_analysis_INTEGRATED.R")

# Instead of step14 + step14B + step14_bootstrap separately
source("step14_cluster_validation_INTEGRATED.R")

# Enhanced meta-analysis with sensitivity analysis
source("step17_meta_analysis_NEW.R")

# GitHub-ready endotype analysis
source("step23_neuroimaging_endotypes_GITHUB.R")

Option 2: Parallel Processing

For large datasets, enable parallel processing in R scripts:

# In step14_cluster_validation_INTEGRATED.R
library(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
# ... parallel bootstrap code ...
stopCluster(cl)

Option 3: Custom Cohort Analysis

To analyze your own cohort:

  1. Format data according to specifications above
  2. Run steps 1-6 for preprocessing
  3. Run step 7-8 for integration and clustering
  4. Apply the trained classifier from step19:
# Load your data
new_cohort <- read.csv("Your_Cohort_Data.csv")

# Load trained classifier
classifier <- readRDS("ADNI_Classifier.rds")

# Predict subtypes
predictions <- predict(classifier, newdata = new_cohort)

Reproducibility

Random Seeds

All scripts use fixed random seeds for reproducibility:

  • Python scripts: np.random.seed(42)
  • R scripts: set.seed(42)

Session Info

To ensure reproducibility, save session information:

# At end of analysis
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")

Docker Support (Optional)

For complete reproducibility, consider using Docker:

FROM rocker/tidyverse:4.2
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# ... additional setup ...

Troubleshooting

Common Issues

Issue 1: Missing input files

Error: File 'cluster_results.csv' not found

Solution: Ensure you run preprocessing steps (1-9) before analysis steps (10-23)

Issue 2: Package installation errors

Error: package 'limma' is not available

Solution: Install from Bioconductor:

BiocManager::install("limma")

Issue 3: Memory errors in VAE clustering

MemoryError: Unable to allocate array

Solution: Reduce batch size or use fewer features in step8_vae_clustering.py

Issue 4: Convergence warnings in meta-analysis

Warning: Egger test unreliable with < 5 studies

Solution: This is expected with 3 cohorts; interpret cautiously


Performance Benchmarks

Typical runtime on a standard workstation (16GB RAM, 8-core CPU):

Phase Steps Time Memory
Preprocessing 1-6 ~10 min < 2GB
VAE Clustering 7-8 ~30 min 4-8GB
Statistical Analysis 10-13 ~15 min < 4GB
Validation 14-16 ~45 min < 4GB
Meta-Analysis 17-18 ~5 min < 2GB
Discovery-Validation 19-21 ~20 min < 4GB
Characterization 22-23 ~10 min < 2GB
Total 1-23 ~2.5 hrs < 8GB

Contributing

We welcome contributions! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/YourFeature)
  3. Commit changes (git commit -m 'Add YourFeature')
  4. Push to branch (git push origin feature/YourFeature)
  5. Open a Pull Request

Code Standards

  • Python: Follow PEP 8 style guide
  • R: Follow tidyverse style guide
  • All comments in English
  • Include docstrings/roxygen documentation
  • Add unit tests where applicable

Citation

If you use this code in your research, please cite:

@article{ADMultimodalSubtypes2025,
  title={Multimodal Deep Phenotyping Reveals Biologically Distinct Alzheimer's Disease Subtypes},
  author={Your Name and Collaborators},
  journal={Journal Name},
  year={2025},
  volume={XX},
  pages={XXX-XXX},
  doi={10.XXXX/XXXXX}
}

Code Repository:

@software{ADMultimodalCode2025,
  title={AD Multimodal Subtype Discovery Pipeline},
  author={Your Name},
  year={2025},
  publisher={GitHub},
  url={https://github.com/YOUR_USERNAME/AD_Multimodal_Study}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • ADNI: Alzheimer's Disease Neuroimaging Initiative
  • AIBL: Australian Imaging, Biomarker & Lifestyle Flagship Study
  • HABS: Harvard Aging Brain Study
  • A4: Anti-Amyloid Treatment in Asymptomatic Alzheimer's Disease Study

Contact

For questions or collaboration inquiries:


Version History

  • v1.0.0 (December 2025): Initial public release
    • 23-step complete pipeline
    • Integrated scripts for key analyses
    • Comprehensive documentation
    • GitHub-ready, SCI journal compliant

Project Status

🟒 Active Development - This repository is actively maintained and updated.

Last Updated: December 2025
Status: Production-ready, validated across 4 independent cohorts
Code Quality: βœ… GitHub-ready, βœ… SCI journal compliant, βœ… No Chinese characters


⭐ If you find this repository useful, please consider starring it!

About

Multimodal deep phenotyping reveals biologically distinct Alzheimer's disease subtypes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published