Skip to content

[ECCV 2024] MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Notifications You must be signed in to change notification settings

Open-Debin/MM2Latent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

arXiv Paper

Authors' official PyTorch implementation of the "MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance", accepted in the Advances in Image Manipulation workshop (AIM) Workshop of ECCV 2024. If you find this code useful for your research, please cite our paper.

[MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance"]
Debin Meng, Christos Tzelepis, Ioannis Patras, and Georgios Tzimiropoulos
Advances in Image Manipulation (AIM) Workshop of ECCV 2024.
Abstract: Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework — MM2Latent — for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. alt text

📦 Environment Setup

1. Clone the Repository

git clone https://github.com/Open-Debin/MM2Latent.git
cd MM2Latent
mkdir outsource
cd outsource
git clone https://github.com/omertov/encoder4editing.git
cd ..

2. Recommended Versions

  • Python: 3.8.17
  • CUDA: 12.2.2 (built with gcc-12.2.0)

Other environments may also work, as long as you can successfully run the StyleGAN2 generator.

3. Create Conda Environment

conda env create -f ./environment/environement.yml
conda activate mm2latent  

🛠️ Installation Steps

1. Download Pretrained Models

Download StyleGAN2 Generator and FaRL_ep64 model. After downloading, place both files in the ./models directory.

2. Data Processing - Face Parsing

cd ./face_parsing
python face_rgb2greyseg.py
python face_rgb2sketch.py
cd ..

3. Extract Multimodal Embeddings

cd ./modalities_encoding
python greyseg2embed.py
python sketch2embed.py
cd ..

▶️ Run the Demo

Open and run the .ipynb demo files under the ~/demo directory using Jupyter Notebook or similar tools.

Citation

If you find this work useful, please consider citing it:

@article{meng2024mm2latent,
  title={MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance},
  author={Meng, Debin and Tzelepis, Christos and Patras, Ioannis and Tzimiropoulos, Georgios},
  journal={arXiv preprint arXiv:2409.11010},
  year={2024}
}

Acknowledgment

This research was supported by the EU's Horizon 2020 programme H2020-951911 AI4Media project.

About

[ECCV 2024] MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages