Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Molino, D., Caruso, C. M., Ruffini, F., Soda, P., Guarrasi, V. (2025)
Our approach combines:
- A 3D CLIP-style encoder for vision-language alignment between CT volumes and radiology reports.
- A volumetric VAE for latent compression of 3D CT data.
- A Latent Diffusion Model with cross-attention conditioning for controllable text-to-CT generation.
This design enables direct synthesis of anatomically consistent, semantically faithful, and high-resolution CT volumes from textual descriptions.
We release 1,000 synthetic chest CT scans generated with our model for the VLM3D Challenge.
➡️ Available on Hugging Face: Synthetic Text-to-CT Dataset
- Preprint: arXiv:2506.00633
- Python 3.10.8
- Install dependencies:
pip install -r requirements.txtautoencoder_epoch273.ptunet_rflow_200ep.ptCLIP3D_Finding_Impression_30ep.pt
You can download them from Hugging Face at Weights:
from huggingface_hub import hf_hub_download
repo_id = "dmolino/text2ct-weights"
autoencoder = hf_hub_download(repo_id, "autoencoder_epoch273.pt")
unet = hf_hub_download(repo_id, "unet_rflow_200ep.pt")
clip = hf_hub_download(repo_id, "CLIP3D_Finding_Impression_30ep.pt")Set these paths in the configs:
trained_autoencoder_path->autoencoderexisting_ckpt_filepath/model_filename->unetclip_weights->clip
scripts/download_ctrate.py: download CT-RATE volumes from HF.scripts/preprocess_ctrate.py: reorient/clip/resample CT-RATE to fixed spacing/shape.scripts/save_embeddings_ctrate.py: encode reports with CLIP3D and save impressions as npy.scripts/diff_model_create_training_data.py: extract VAE latent embeddings for CT volumes.scripts/diff_model_train.py: train diffusion UNet.scripts/diff_model_infer.py: batch inference over data lists.scripts/diff_model_demo.py: one-off generation from a provided report (no precomputed impressions).
We use the CT-RATE dataset (Hugging Face). Helpers provided:
- Download:
scripts/download_ctrate.py(pull volumes from HF). - Preprocess:
scripts/preprocess_ctrate.py(reorient to RAS, clip HU, resample to fixed spacing/shape).
After download/preprocess, ensure:
dataset/contains the CT volumes.data/train_data_volumes.jsonanddata/validation_data_volumes.jsonlist volumes with relative paths (e.g.,dataset/train/...).data/train_reports.csvanddata/validation_reports.csvcontain the text reports (VolumeName,Findings_EN,Impressions_EN).
- VAE latent embeddings (CT) –
scripts/diff_model_create_training_data.py:
python scripts/diff_model_create_training_data.py \
--model_def ./configs/config_rflow.json \
--model_config ./configs/config_diff_model.json \
--env_config ./configs/environment_diff_model_train.json \
--num_gpus 1 \
--index 0Key fields in environment_diff_model_train.json:
data_base_dir: set todatasetembedding_base_dir: output folder for latents (e.g.,./embeddings)trained_autoencoder_path:./models/autoencoder_epoch273.pt
- Report embeddings (CLIP3D) –
scripts/save_embeddings_ctrate.py:
python scripts/save_embeddings_ctrate.py \
--train_json data/train_data_volumes.json \
--val_json data/validation_data_volumes.json \
--train_reports data/train_reports.csv \
--val_reports data/validation_reports.csv \
--data_base_dir dataset \
--embedding_base_dir ./embeddings \
--clip_weights ./models/CLIP3D_Finding_Impression_30ep.pt \
--report_encoder_model xgem_3DTrain the diffusion UNet – scripts/diff_model_train.py:
python scripts/diff_model_train.py \
--model_def ./configs/config_rflow.json \
--model_config ./configs/config_diff_model.json \
--env_config ./configs/environment_diff_model_train.json \
--num_gpus 1Weights expected in models/:
autoencoder_epoch273.ptunet_rflow_200ep.pt(or your own checkpoint viaexisting_ckpt_filepath)
Run inference over a data list – scripts/diff_model_infer.py:
python scripts/diff_model_infer.py \
--model_def ./configs/config_rflow.json \
--model_config ./configs/config_diff_model.json \
--env_config ./configs/environment_diff_model_eval.json \
--num_gpus 1 \
--index 0 \
--resize 512Outputs go to output_dir set in environment_diff_model_eval.json.
Generate a CT volume from a custom report (no precomputed impressions) – scripts/diff_model_demo.py:
python scripts/diff_model_demo.py \
--model_def ./configs/config_rflow.json \
--model_config ./configs/config_diff_model.json \
--env_config ./configs/environment_diff_model_eval.json \
--num_gpus 1Edit example_report inside the script to your text. Output: predictions/demo.nii.gz.
For questions or collaborations:
Daniele Molino – daniele.molino@unicampus.it
This repository is heavily based on:
- MAISI tutorials (MONAI): https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi
- XGeM: https://github.com/cosbidev/XGeM
