Conditional denoising diffusion probabilistic model trained in latent space to generate paintings by famous artists. See the animation of the latent diffusion process in the figure below.
Fig. 1. The animation of the latent diffusion process.
The model is able to generalize to different image sizes. See generated examples below.
Fig. 2. Generated painting in the style of Ivan Aivazovsky.
Fig. 3. Generated painting in the style of Ivan Aivazovsky.
Fig. 4. Generated painting in the style of Ivan Aivazovsky.
Fig. 5. Generated painting in the style of Martiros Saryan.
Fig. 6. Generated painting in the style of Camille Pissarro.
Fig. 7. Generated painting in the style of Pyotr Konchalovsky.
Fig. 8. Generated painting in the style of Pierre Auguste Renoir.
- config.py is a file with model hyperparameters.
- dataset.py contains dataset class.
- generate_features.py contains functions to prepare dataset.
- models.py contains implementations of the latent UNet model.
- pipeline.py is a latent diffusion pipeline.
- train.py performs training of the LatentUNet model using a single GPU instance.
- evaluate.py performs evaluation of trained pipeline.
- the notebook inference_example includes inference examples of the developed pieline.
We used the WikiArt dataset containing 81444 pieces of visual art from various artists. All images were cropped and resized to 512x512 resolution. To convert images into latent representation we apply the pretrained VQ-VAE from the Stable Diffusion model implemented by StabilityAI.
We adapted 2D UNet model from Hugging Face diffusers package by adding three additional embedding layers to control paining style, including artist name, genre name and style name. Before adding the style embedding to time embedding, we pass each type of style embedding through PreNet modules.
The network is trained to predict the unscaled noise component using Huber loss function (it produces better results on this dataset compared to L2 loss). During evaluation, the generated latent representations are decoded into images using the pretrained VQ-VAE.