MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning [WACV 2025]
This repository contains the official PyTorch implementation of MegaFusion: https://arxiv.org/abs/2408.11001/
We are in the process of standardizing and gradually open-sourcing our code in the near future, so please stay tuned.
Project Page
- [2024.10.29] MegaFusion has been accepted to WACV 2025.
- [2024.9.10] A new version of the paper has been updated. Please check out our latest version paper for further technical details, evaluations, and visualizations.
- [2024.8.20] Our pre-print paper is released on arXiv, we are working on releasing our code and will open-source it shortly.
- Python >= 3.8 (Recommend to use Anaconda or Miniconda)
- PyTorch >= 1.12
- xformers == 0.0.13
- diffusers == 0.13.1
- accelerate == 0.17.1
- transformers == 4.27.4
A suitable conda environment named megafusion
can be created
and activated with:
conda env create -f environment.yaml
conda activate megafusion
Since our MegaFusion is designed to extend existing diffusion-based text-to-image models towards higher-resolution generation. We provide the offical MegaFusion implementations on several representative models, including StableDiffusion, StableDiffusion-XL, DeepFloyd, ControlNet, and IP-Adapter.
First, please download pre-trained StableDiffusion-1.5 from SDM-1.5. Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./SDM-MegaFusion/ckpt/stable-diffusion-v1-5/
.
Run the inference demo with:
CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py
Taking computational overhead into consideration, we only use SDXL-base, and discard SDXL-refiner in our project. First, please download pre-trained StableDiffusion-XL from SDXL-base. Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./SDXL-MegaFusion/ckpt/
.
To be updated soon...
Taking computational overhead into consideration, we only use the first two stages of DeepFloyd, and discard the last stage in our project.
First, please download pre-trained DeepFloyd from SDM. Then, all the pre-trained checkpoints should be placed into the corresponding location in the folder ./DeepFloyd/ckpt/
.
To be updated soon...
To be updated soon...
To be updated soon...
Our main experiments are conducted on the commonly used MS-COCO dataset, you can download it from MS-COCO.
Taking SDM-MegaFusion as an example, you can load the captions, and sample images with them as conditions via:
CUDA_VISIBLE_DEVICES=0 accelerate launch synthesize.py
We use the commonly used FID and KID as main evalutaion metrics. Besides, to quantitatively evaluate the semantic correctness of the synthesized results, we also utilize several language-based scores in our work, including CLIP-T CIDEr, Meteor, and ROUGE.
To be updated soon...
In our project, we use a state-of-the-art open-sourced VLM, MiniGPT-v2 to give a caption for each synthesized image, to further evaluate the semantic correctness of higher-resolution generation.
To be updated soon...
- Release Paper
- Complete Bibtex
- Model Code of SDM-MegaFusion
- Inference Code of SDM-MegaFusion
- Image Caption Code of MiniGPT-v2
- Evaluation Code
- Model Code of SDXL-MegaFusion
- Inference Code of SDXL-MegaFusion
- Model Code of Floyd-MegaFusion
- Inference Code of Floyd-MegaFusion
If you use this code for your research or project, please cite:
@InProceedings{wu2024megafusion,
author = {Wu, Haoning and Shen, Shaocheng and Hu, Qiang and Zhang, Xiaoyun and Zhang, Ya and Wang, Yanfeng},
title = {MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2025},
}
Many thanks to the code bases from diffusers, SimpleSDM, SimpleSDXL, and DeepFloyd.
If you have any questions, please feel free to contact haoningwu3639@gmail.com or shenshaocheng@sjtu.edu.cn.