GestaltM2D-VL: Gestalt Multimodal Model for Mendelian Disease Diagnosis with Vision–Language Integration
GestaltM2D-VL (Gestalt Multimodal Model for Mendelian Disease Diagnosis with Vision–Language Integration)
A diagnosis-first multimodal model that leverages facial images and HPO-encoded clinical text for Mendelian rare disease diagnosis.
Additionally, we train StyleGAN3 on GMDB faces to generate synthetic facial images used for qualitative evaluation and analysis.
- Project Overview
- Dataset: GMDB Ten-Disease Subset
- Data Preprocessing Pipeline
- Part I – GestaltM³D-VL for Multimodal Diagnosis
- Part II – StyleGAN3 Synthesis for Evaluation
- Repository Structure
- Installation
- Data Preparation
- Training & Evaluation
- Figures to Include
- Citation
- License & Disclaimer
This repository implements a disease-diagnosis–centric pipeline on the GMDB ten-disease subset (and optionally CHOP cohorts):
-
Part I – GestaltM³D-VL (Main Contribution: Multimodal Diagnosis)
- A vision–language model based on Qwen-VL that ingests:
- Preprocessed facial images.
- HPO-encoded clinical text (with or without demographics).
- Outputs Mendelian rare disease labels under a long-tailed label space.
- Uses Class-Balanced Focal Loss to focus on rare diseases.
- A vision–language model based on Qwen-VL that ingests:
-
Part II – StyleGAN3-based Synthesis (Support: Evaluation & Visualization)
- Train class-conditional StyleGAN3 on GMDB faces.
- Generate synthetic facial images:
- For qualitative evaluation of disease-specific gestalt.
- For visual case studies comparing predicted diseases and synthetic exemplars.
We use the GestaltMatcher Database (GMDB), which pairs:
- Facial images
- HPO-encoded clinical text
- Demographic information (ethnicity, age, sex)
Overall label space: 528 syndromes/disorders (244 frequent, 284 rare).
In this repository we focus on the ten diseases with the most distinctive facial phenotypes, curated by clinical geneticists.
Key stats for the ten-disease subset:
- 1,847 cases with patient-level splits.
- Test set: only HPO-annotated cases (≈ 818 cases).
- Non-HPO cases are used only as data augmentation during training.
⚠️ Dataset access
GMDB (and CHOP, if used) are not included in this repository.
You must obtain data access following your institutional, IRB, and data-use agreements, then place the files underdata/raw/.
Original facial images in GMDB/CHOP often suffer from:
- Low resolution, blur
- Grayscale or washed-out colors
- Cluttered backgrounds
We build a three-stage preprocessing pipeline shared by both the diagnosis model and StyleGAN3:
-
Color Restoration – DDColor
- Colorize grayscale / washed-out facial images.
-
Face Restoration & Super-Resolution – GFPGAN
- Enhance facial details and upsample low-resolution faces.
-
Face Detection & Normalization – MediaPipe Face Detection
- Detect faces and perform tight cropping.
- Remove background and resize to 224×224 RGB.
- Normalize input space and control token budget for the multimodal LLM.
The script src/data/preprocess_faces.py orchestrates DDColor, GFPGAN, and MediaPipe and writes processed images into data/processed/images.
✅ This is the main part of the project.
All other components (e.g., StyleGAN3) are designed to support or analyze this diagnosis model.
We target Mendelian rare disease diagnosis from:
- Facial gestalt (front-view facial images).
- HPO-encoded clinical text (+/- demographics).
Challenges:
- GMDB is highly long-tailed (many rare diseases with few examples).
- Data is noisy (missing HPO terms, variable image quality).
- Standard cross-entropy tends to overfit head classes and underperform on rare diseases.
GestaltM³D-VL is built on Qwen-VL models:
- Qwen-2-VL-7B-Instruct
- Qwen-2.5-VL-7B-Instruct (default backbone)
- (Optional) Qwen-3-VL-8B-Instruct
Advantages:
- Strong vision–language alignment.
- Good parameter–compute trade-off (7–8B).
- Handles long, noisy clinical text.
We adapt Qwen-VL into a multimodal sequence classifier:
-
Inputs:
image: 224×224 preprocessed facial image.text:- HPO-encoded phenotypes (primary signal).
- Optional demographics (often dropped to reduce bias).
-
Multimodal encoding:
- Construct an instruction-style prompt that embeds HPO terms into natural language.
- Feed image + text into Qwen-VL.
- Obtain the hidden state at a special
[CLS]token (or equivalent pooled representation).
-
Classifier head:
- Apply a small MLP/Linear(D → #classes) head to predict disease logits.
-
Loss:
- Use Class-Balanced Focal Loss to handle severe imbalance.
Let ( n_y ) be the number of samples of class ( y ).
We compute effective class weights ( \alpha_y ) based on the effective number of samples (e.g., Cui et al., CVPR 2019) and adopt focal loss with focusing parameter ( \gamma ):
- Increases the importance of rare diseases.
- Focuses training on hard examples instead of easy head-class samples.
Implementation:
src/models/mm_llm/losses.py
Typical strategy:
- Freeze a large portion of the text backbone.
- Optionally partially freeze the vision backbone.
- Train:
- Multimodal projector layers.
- Top transformer blocks (via LoRA or full finetuning).
- Final classifier head.
Training entry point:
src/models/mm_llm/train_mm_llm.pyscripts/train_mm_llm.sh
Evaluation:
src/models/mm_llm/evaluate_mm_llm.pyscripts/eval_mm_llm.sh
- Multimodal (image + text) > image-only.
- Dropping demographics can maintain or improve performance and reduce explicit reliance on ethnicity/age/sex.
- Upgrading from Qwen-2-VL to Qwen-2.5-VL improves overall diagnosis accuracy and especially rare-class performance.
ℹ️ StyleGAN3 is auxiliary in this repository.
Its role is to support evaluation and visualization of rare disease facial gestalt, not to perform diagnosis itself.
We use StyleGAN3 as a class-conditional face generator trained on GMDB ten-disease faces. Synthetic images are used to:
- Visualize disease-specific facial patterns learned from GMDB.
- Provide synthetic exemplars for:
- Teaching materials
- Qualitative comparison with GestaltM³D-VL predictions
- Optionally stress-test robustness (e.g., more varied poses, ages, or subtle changes).
- Alias-free generation for better geometric consistency.
- Class-conditional setup across ten disease classes (+ optional unaffected class).
- Trained on the same preprocessed 224×224 faces used by GestaltM³D-VL.
Configuration (see configs/stylegan3/ten_disease_default.yaml):
- Dataset: preprocessed GMDB faces.
- Classes: 10 diseases (+/- unaffected).
Run:
bash scripts/train_stylegan3.sh
bash scripts/sample_stylegan3.shThis will populate:
data/processed/stylegan3_samples/
disease_0/
disease_1/
...
with class-conditional synthetic faces.
You can use the synthetic images to:
- Build grids of synthetic faces per disease.
- Compare real patient images & model predictions with:
- Synthetic faces from the predicted disease.
- Support case studies and clinical teaching materials.
We provide an interactive web-based gallery to explore the synthetic facial images generated by our StyleGAN3 model:
🌐 Phenotype-Disease Synthetic Image Database
This gallery features:
- StyleGAN3-generated synthetic faces for all ten disease classes
- Interactive filtering by disease type, age, gender, and ethnicity
- Metadata annotations corresponding to the generation conditions
- Download functionality for research and educational purposes
⚠️ Disclaimer: All images displayed are fully synthetic and generated by StyleGAN3. They do not represent real patients and are intended for research and educational purposes only.
Diagnosis is the core; StyleGAN3 is auxiliary:
.
├── README.md
├── LICENSE
├── requirements.txt
├── environment.yml
├── configs/
│ ├── mm_llm/
│ │ └── qwen2_5_vl_ten_disease.yaml
│ └── stylegan3/
│ └── ten_disease_default.yaml
├── data/
│ ├── raw/
│ ├── processed/
│ │ └── stylegan3_samples/
│ └── splits/
│ └── ten_disease/
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
├── docs/
│ └── figures/
│ ├── overview_pipeline.png
│ ├── example_cases.png
│ ├── dataset_stats.png
│ ├── preprocessing_pipeline.png
│ ├── mm_llm_architecture.png
│ ├── stylegan3_overview.png
│ ├── stylegan3_synthetic_examples.png
│ └── method_block_diagram.png
├── src/
│ ├── data/
│ │ ├── gmdb_dataset.py
│ │ ├── preprocess_faces.py
│ │ └── build_splits.py
│ ├── models/
│ │ ├── mm_llm/
│ │ │ ├── mm_classifier.py
│ │ │ ├── losses.py
│ │ │ ├── train_mm_llm.py
│ │ │ └── evaluate_mm_llm.py
│ │ └── stylegan3/
│ │ ├── train_stylegan3.py
│ │ └── sample_stylegan3.py
│ └── utils/
│ ├── config.py
│ ├── logging_utils.py
│ ├── metrics.py
│ └── seed.py
└── scripts/
├── prepare_gmdb.sh
├── train_mm_llm.sh
├── eval_mm_llm.sh
├── train_stylegan3.sh
└── sample_stylegan3.sh
conda env create -f environment.yml
conda activate gestaltm3d-vlpython -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txtrequirements.txt (sketch):
torch,torchvision,torchaudiotransformers,accelerate,bitsandbytes(optional)timmopencv-python,mediapipegfpgan,ddcolorscikit-learnpandas,numpymatplotlib,seabornrich,pyyaml
Place GMDB (and optional CHOP) metadata and images under:
data/raw/
gmdb_metadata.csv
images/
patient_*.jpg
The exact file naming depends on your internal export.
Adjust insrc/data/build_splits.pyandsrc/data/preprocess_faces.pyaccordingly.
python -m src.data.build_splits \
--input_meta data/raw/gmdb_metadata.csv \
--output_dir data/splits/ten_diseaseCreates:
train.csvval.csvtest.csv
with columns such as:
image_pathdisease_labelhpo_textdemographics(optional)
python -m src.data.preprocess_faces \
--input_dir data/raw/images \
--meta_csv data/splits/ten_disease/train.csv \
--output_dir data/processed/imagesRepeat or adapt for val/test splits.
# Train GestaltM3D-VL on GMDB ten-disease subset
bash scripts/train_mm_llm.sh
# Evaluate on held-out test set
bash scripts/eval_mm_llm.shControl hyperparameters and backbone in:
configs/mm_llm/qwen2_5_vl_ten_disease.yaml
# Train StyleGAN3 on preprocessed GMDB faces
bash scripts/train_stylegan3.sh
# Sample synthetic faces per disease class
bash scripts/sample_stylegan3.shSynthetic samples are stored in:
data/processed/stylegan3_samples/
placeholderAlso consider citing:
- GestaltMatcher / GMDB / GestaltMML
- Qwen2-VL / Qwen2.5-VL / Qwen3-VL technical reports
- StyleGAN3, DDColor, GFPGAN, MediaPipe
- License: MIT LICENSE.
- Data: This repository does not include patient-identifiable GMDB/CHOP data.
- Usage:
- Any use of real patient data must follow IRB, data-use agreements, and local regulations.
- Synthetic images are intended for research and evaluation only.
This repository is for research and method development only.
It is not a certified medical device and should not be used directly for clinical decision-making without thorough validation and regulatory approval.







