MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Authors' official PyTorch implementation of the "MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance", accepted in the Advances in Image Manipulation workshop (AIM) Workshop of ECCV 2024. If you find this code useful for your research, please cite our paper.

[MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance"]
Debin Meng, Christos Tzelepis, Ioannis Patras, and Georgios Tzimiropoulos
Advances in Image Manipulation (AIM) Workshop of ECCV 2024.
Abstract: Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework — MM2Latent — for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods.

📦 Environment Setup

1. Clone the Repository

git clone https://github.com/Open-Debin/MM2Latent.git
cd MM2Latent
mkdir outsource
cd outsource
git clone https://github.com/omertov/encoder4editing.git
cd ..

2. Recommended Versions

Python: 3.8.17
CUDA: 12.2.2 (built with gcc-12.2.0)

Other environments may also work, as long as you can successfully run the StyleGAN2 generator.

3. Create Conda Environment

conda env create -f ./environment/environement.yml
conda activate mm2latent

🛠️ Installation Steps

1. Download Pretrained Models

Download StyleGAN2 Generator and FaRL_ep64 model. After downloading, place both files in the ./models directory.

2. Data Processing - Face Parsing

cd ./face_parsing
python face_rgb2greyseg.py
python face_rgb2sketch.py
cd ..

3. Extract Multimodal Embeddings

cd ./modalities_encoding
python greyseg2embed.py
python sketch2embed.py
cd ..

▶️ Run the Demo

Open and run the .ipynb demo files under the ~/demo directory using Jupyter Notebook or similar tools.

Citation

If you find this work useful, please consider citing it:

@article{meng2024mm2latent,
  title={MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance},
  author={Meng, Debin and Tzelepis, Christos and Patras, Ioannis and Tzimiropoulos, Georgios},
  journal={arXiv preprint arXiv:2409.11010},
  year={2024}
}

Acknowledgment

This research was supported by the EU's Horizon 2020 programme H2020-951911 AI4Media project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

📦 Environment Setup

1. Clone the Repository

2. Recommended Versions

3. Create Conda Environment

🛠️ Installation Steps

1. Download Pretrained Models

2. Data Processing - Face Parsing

3. Extract Multimodal Embeddings

▶️ Run the Demo

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
demo		demo
environment		environment
face_parsing		face_parsing
image		image
lib		lib
modalities_encoding		modalities_encoding
models		models
README.md		README.md

Open-Debin/MM2Latent

Folders and files

Latest commit

History

Repository files navigation

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

📦 Environment Setup

1. Clone the Repository

2. Recommended Versions

3. Create Conda Environment

🛠️ Installation Steps

1. Download Pretrained Models

2. Data Processing - Face Parsing

3. Extract Multimodal Embeddings

▶️ Run the Demo

Citation

Acknowledgment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages