This repository hosts the official project page for our work:
Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Molino, D., Caruso, C. M., Ruffini, F., Soda, P., Guarrasi, V. (2025)
Our approach combines:
- A 3D CLIP-style encoder for vision-language alignment between CT volumes and radiology reports.
- A volumetric VAE for latent compression of 3D CT data.
- A latent diffusion model with cross-attention conditioning for controllable text-to-CT generation.
This design enables direct synthesis of anatomically consistent, semantically faithful, and high-resolution CT volumes from textual descriptions.
We release 1,000 synthetic chest CT scans generated with our model for the VLM3D Challenge.
β‘οΈ Available on Hugging Face: Synthetic Text-to-CT Dataset
- Preprint: arXiv:2506.00633
The full training and inference code will be made available soon.
Stay tuned for updates! β¨
For questions or collaborations, please reach out to:
Daniele Molino β daniele.molino@unicampus.it