Advancing Our Understanding of Diffusion Models:
A Deep Dive Into "Diffusion Models Already Have a Semantic Latent Space"
In this blog post, we discuss, reproduce, and extend on the findings of the ICLR2023 paper titled "Diffusion Models Already Have a Semantic Latent Space". The paper introduces an algorithm called asymmetric reverse process (Asyrp) to uncover a semantic latent space in frozen diffusion models. The authors showcase the effectiveness of Asyrp in the domain of image editing.
The purpose of this blog post is threefold:
- Help other researchers understand Asyrp's algorithm.
- Verify the authors' claims by reproducing the results.
- Extend on discussion points of the paper.
Diffusion models (DMs) can be effectively used for image editing, i.e., adding target attributes to real images. Multiple ways of achieving the task have been explored in the past, including image guidance [1, 10], classifier guidance [3, 9], and model fine-tuning [7]. These methods, however, fall short either because of the ambiguity and lack of control of the steering direction and magnitude of change or because of the high computational costs they induce. For generative adversarial networks (GANs), one can find editing directions directly in their latent space. By implication, discovering such space in diffusion models could provide the sought-after image editing capabilities. To obtain it, Preechakul et al. [14] suggest adding a latent vector of the original image produced by an additional encoder to the reverse diffusion process. Their approach, however, cannot be applied to pretrained diffusion models as training must be performed with the added encoder already in place. To provide an alternative that solves the issue, Kwon et al. [8] propose Asyrp, an algorithm finding editing directions from the latent space in pretrained diffusion models.
Most current DMs implement a U-Net architecture, an autoencoder-like network used to predict the noise added at a particular stage of the diffusion process. Asyrp discovers semantic meaning in the bottleneck of the U-Net. Augmentation of this bottleneck,
Asyrp is trained to minimize a weighted loss function consisting of directional CLIP loss and reconstruction loss. To support the results of their method, Kwon et al. [8] perform extensive qualitative and limited quantitative experiments. The metrics that they evaluate their methodology on are directional CLIP similarity and segmentation consistency.
In order to test the performance and generalizability of the proposed algorithm, we reproduce their main results, both qualitative and quantitative, on the CelebA-HQ [6] dataset, introduce an additional quantitative metric, the FID score, and propose architectural changes to the neural network producing the semantic latent space used for editing. In particular, we suggest the use of a transformer-based network and perform an extensive ablation study on it. Also, since LDMs currently represent the state-of-the-art in image generation [16], we consider it vital to investigate whether the method could be applied to them in order to obtain meaningful attribute edits of the original images.
Over the past few years, we have observed a surge in popularity of generative models due to their proven ability to create realistic and novel content. DMs are a powerful new family of these models which has been shown to outperform other alternatives such as variational autoencoders (VAEs) and generative adversarial networks (GANs) on image synthesis [3]. The basic idea behind them is to gradually add noise to the input data during the forward process and then train a neural network to recover the original data step-by-step in the reverse process. The Asyrp paper's authors chose to base their work on Denoising Diffusion Probabilistic Models (DDPM) [11] and its successors, a widely-used algorithm that effectively implements this concept. In DDPMs the forward process
To run the process in reverse starting from a sample
Figure 1. The Markov process of diffusing noise and denoising [5]. |
In DDPM
One major improvement on this algorithm was the Denoising Diffusion Implicit Model (DDIM) [17]. In DDIM an alternative non-Markovian noising process is used instead of Equation 1 as shown in Equation 6. Down the line this leads to a change in the way an arbitrary step is sampled in the reverse process to Equation 7, with
Equation 7 was the starting point for the Asyrp paper, however they reformulated it as shown in Equation 8. Why this is convenient will become apparent in the next section. In this formulation
In practice, this boils down to training one neural network
Note For a thorough introduction to Diffusion Models we would like to highlight an outstanding blog post by Lilian Weng.
This returns us to the original goal of the Asyrp paper, i.e. to manipulate the semantic latent space of images generated from Gaussian noise with a pretrained and frozen diffusion model to edit them. To achieve this the authors propose an asymmetric reverse process (Asyrp) in which they alter the way an arbitrary step is sampled in the reverse process to Equation 10.
As can be seen the noise estimate used to predict
But that raises an important question: How to edit the predicted noise in a meaningful way such that the change in the image reflects the semantic change that the user wants?
In practise, all SOTA diffusion models use the U-net architecture to approximate
The neural network,
Figure 2. Asyrp training visualization. |
CLIP (Contrastive Language-Image Pretraining) [15] is a multi-modal, zero-shot model that predicts the most relevant caption for an image. It consists of a text encoder and an image encoder (both relying on a transformer architecture) that encode the data into a multimodal embedding space. The encoders are jointly trained on a dataset of images and their true textual descriptions, using a contrastive loss function. This loss function aims to maximize the cosine similarity of images and their corresponding text and minimize the similarity between images and texts that do not occur together.
For the neural network used for predicting
This leads to the loss function that Asyrp is trained to minimize in Equation 13, where
Figure 3 visualizes the generative process of Asyrp intuitively. As shown by the green box on the left, the process only changes
Figure 3. Asymmetric reverse process (Asyrp) visualization [8]. |
The original architecture of the neural network,
Figure 4. Original architecture of the neural network |
Figure 5. Our transformer-based architecture of the neural network |
The input and output of the module is an embedding of size
We propose two ways of reinterpreting the data to get these sequences. We either interpret the channel dimensions of the image as the token dimension, resulting in a sequence length of
The temporal information about the denoising step is integrated into the original model by first linearly projecting the timestep embedding and then adding it to the embedding that was processed by the input module. We investigate with the integration of the temporal embedding by changing this addition to a multiplication, additionally we also test integrating the temporal embedding using an adjusted adaptive group norm.
We experiment with 2 ways of normalizing the aggregated output of the encoder: group norm, where the mean and standard deviation are computed at group level (32 groups) and instance norm, where they are computed for each sample individually. A SiLU activation function is applied to this embedding before it's passed through the final output layer. We examine this activation function by swapping it out for a GeLU and simple ReLU.
In order to evaluate the performance of diffusion models when it comes to image editing, besides qualitative results and conducting user studies [8, 7], the following metrics are generally used: Directional CLIP similarity (
The directional CLIP similarity score measures how well the diffusion model preserves the direction of gradients in an image after editing. It is mathematically computed as
Semantic consistency is a metric that has been introduced in order to evaluate the consistency of network predictions on video sequences. In the image editing setting, it compares the segmentation maps of the reference and the edited image by computing the mean intersection over the union of the two. Knowing this, we can reason that high SC scores do not necessarily mean good image content modification, as can be seen in Figure 6. This is an example that clearly shows how this metric fails on evaluating editing performance. The DiffusionCLIP model tries to preserve structure and shape in the image, while Asyrp allows more changes that lead to desired attribute alterations.
Figure 6. Segmentation masks of the original, Asyrp-edited, and DiffustionCLIP-edited images used to compute segmentation consistency for the "smiling" attribute [8]. |
The FID metric compares the distribution of the edited images with the distribution of the referential images in a feature space. Lower FID scores correspond to better image editing. In order to compute the image features, one commonly employs the Inception-v3 model [18]. In particular, the model's activations of the last layer prior to the output classification layer are calculated for a set of edited and source images. The mean and the covariance of the activations is computed, so they can be modelled as multivariate Gaussians:
We begin by reproducing the qualitative and quantitative results of the original paper. To sustain the limits of our computational budget, we restrict our efforts to the CelebA-HQ [6] dataset. Our experiments are based on the original implementation, however, we found that some of the features required for successful reproduction, especially those relating to quantitative evaluation, are missing from the repository. Generally, we follow the computational set-up specified by the original authors in full. Specifically, we use hyperparameter values presented in Table 1, which were recovered from [8, Table 2] and [8, Table 3]. Across all experiments, we use
Label | Domain | |||||
---|---|---|---|---|---|---|
smiling | "face" | "smiling face" | 0.8 | 513 | IN | |
sad | "face" | "sad face" | 0.8 | 513 | IN | |
angry | "face" | "angry face" | 0.8 | 512 | IN | |
tanned | "face" | "tanned face" | 0.8 | 512 | IN | |
man | "a person" | "a man" | 0.8 | 513 | IN | |
woman | "a person" | "a woman" | 0.8 | 513 | IN | |
young | "person" | "young person" | 0.8 | 515 | IN | |
curly hair | "person" | "person with curly hair" | 0.8 | 499 | IN | |
nicolas | "Person" | "Nicolas Cage" | 0.8 | 461 | UN | |
pixar | "Human" | "3D render in the style of Pixar" | 0.8 | 446 | UN | |
neanderthal | "Human" | "Neanderthal" | 1.2 | 490 | UN | |
modigliani | "photo" | "Painting in Modigliani style" | 0.8 | 403 | UN | |
frida | "photo" | "self-portrait by Frida Kahlo" | 0.8 | 321 | UN | |
Table 1. Hyperparameter settings of reproducibility experiments. The "domain" column corresponds to the attribute being in-domain (IN) or unseen-domain (UN). |
Figure 7 shows that the results obtained in the original paper and presented in [8, Figure 4] can be successfully reproduced and that editing in the h-space results in visually convincing image generation for in-domain attributes (i.e., attributes that can be directly observed in the training data of the frozen diffusion model). Nevertheless, we must stress that the methodology does not necessarily isolate attribute changes and particular edits may also result in other unintended changes. To give an example, edits for the "curly hair" attribute result in severe facial transformations and appear to overlap with the "smiling" attribute (see the second and the third row of Figure 7).
Figure 7. Editing results for in-domain attributes. |
Figures 8 and 9 depict the results of our reproducibility experiments focused on unseen-domain attributes (i.e., attributes that cannot be observed in the training data) originally presented in [8, Figure 5]. In Figure 8, we use the full
Figure 8. Editing results for unseen-domain attributes. |
Figure 9. Editing results for unseen-domain attributes with reduced editing strength ( |
To appreciate the performance of Asyrp quantitatively, we reproduce evaluation results originally presented in [8, Table 4] and compute the directional CLIP score for the same three in-domain attributes ("smiling", "sad", "tanned") and two unseen-domain attributes ("pixar", "neanderthal") on a set of 100 images from the test set. The original code does not implement either of the evaluation metrics or experiments, meaning we do not know which images were used for the calculations. We choose to take the first 100 images in terms of image IDs. The results are reported in Table 2. Contrary to the original authors, we supply standard deviations showing that the results are quite unstable. More importantly, there are clear differences in the achieved results that we cannot easily explain. Nevertheless, we observe a stable trend of higher scores when lowering the editing strength to half as expected because of the decreased impact on
Metric | Smiling (IN) | Sad (IN) | Tanned (IN) | Pixar (UN) | Neanderthal (UN) | |
---|---|---|---|---|---|---|
Original |
0.921 | 0.964 | 0.991 | 0.956 | 0.805 | |
Reproduced |
0.955 (0.048) |
0.993 (0.037) |
0.933 (0.040) |
0.931 (0.032) |
0.913 (0.035) |
|
Reproduced |
0.969 (0.047) |
0.999 (0.035) |
0.973 (0.036) |
0.942 (0.031) |
0.952 (0.035) |
|
Table 2. Directional CLIP score ( deviations are reported in parentheses. |
We do not implement the segmentation consistency score due to its shortcomings described in the previous section and also the absence of information on choices made by the original authors with respect to its calculation. To make up for it, we compute FID scores which should represent a more meaningful choice in the context of image editing. From the FID scores presented in Table 3, one can observe worsening performance when Asyrp performs editing in directions requiring more substantial changes of the original image. As expected, lowering the editing strength also significantly and consistently reduces the impact in terms of FID. Naturally, the distance between the reconstructed and the edited images is consistently lower than the distance between original and edited images.
Metric | Smiling (IN) | Sad (IN) | Tanned (IN) | Pixar (UN) | Neanderthal (UN) | |
---|---|---|---|---|---|---|
89.2 | 92.9 | 100.5 | 125.8 | 125.8 | ||
73.7 | 70.6 | 73.7 | 89.3 | 74.8 | ||
68.8 | 60.5 | 81.7 | 96.9 | 137.3 | ||
44.4 | 43.7 | 49.7 | 61.0 | 71.7 | ||
Table 3. Frechet Inception Distance ( |
In Figure 10, we present our reproduction of [8, Figure 7] visualizing the linearity of the learned editing directions. For the "smiling" attribute, it is clearly viable to go in the opposite direction of
Figure 10. Image edits for the "smiling" attribute with editing strength in the range from -1 to 1. |
Figure 11 reproduces the results originally presented in [8, Figure 17]. When reconstructing images using a diffusion model with a relatively small number of time steps used for generation, we observe a severe loss in texture resulting in smoothed-out faces with limited details. For training, we used 40 time steps during generation. At inference time, we tried to increase this number to 1,000 and found that it is possible to generate additional texture improving the results at the cost of computation time.
Figure 11. Comparison of generated images for the "smiling" attribute with 40 and 1000 time steps during generation. |
The editing directions found through the asyrp algorithm depend on the knowledge of attributes contained in CLIP. We observe in the output results that these editing directions are often highly biased. Individuals frequently change gender, skin color and eye color when edited with a direction that does not explicitely contain that change. For example, the Pixar editing direction changes the eyecolor of the source images to blue and often changes dark skin to white skin. This effect likely results from the model not being able to disentangle these concepts and has an impact on how useful these directions are in various image editing contexts. We have included some examples of these biased editing directions in Figure 12. Furthemore, in Table 4 we show that the performance of the editing directions is significantly better for caucasian faces than non-caucasian faces.
Metric | Race | Smiling (IN) | Sad (IN) | Tanned (IN) | Pixar (UN) | Neanderthal (UN) |
---|---|---|---|---|---|---|
Caucasian | 81.3 | 76.2 | 75.9 | 103.5 | 94.1 | |
Non-caucasian | 138.4 | 123.3 | 157.3 | 185.2 | 177.6 | |
Caucasian | 54.3 | 55.1 | 66.4 | 76.6 | 111.0 | |
Non-caucasian | 88.3 | 84.1 | 121.1 | 142.0 | 186.8 | |
Table 4. Frechet Inception Distance ( caucasian and non-caucasian individuals. |
Figure 12. Bias in the CLIP editing directions for the "pixar" attribute. |
While the reproduction results show that the general method works well, we set out to investigate further improvements by running an ablation study. As previously mentioned in the fourth section adjustments to the model architecture could provide further gains in performance in terms of the clip similairty, flexibility and transferability. In this section, we conduct several ablations in order to gain a deeper understanding of the asyrp method, aiming to identify its limitations and explore potential improvements.
As described in the model architecture section and shown in Figure 5 the Asyrp method can be broken down into multiple submodules: the two encoder modules, a temporal embedding module and an activation function module. In this section we will look more closely at these modules and propose several adjustments, which we compare to the original implementation. The best modules are picked based on the lowest CLIP directional loss, which is inversely related to the Directional CLIP Similarity as explained in the evaluation section.
As discussed in the architecture section the 1x1 convolutional layers can be replaced by transformer-based blocks. However, "transformer" is a broad term and here we show the ablations we did to get to the final architecture. Firstly, it is important to consider the numbers of epochs. The original architecture was only trained for one epoch, however this might not be suitable for transformer-based blocks as they typically take longer to train. We present all our results for one to four epochs since this hyperparameter holds significant importance in our study.
Next an important architectural decision for the transformer blocks was the number of heads to use. However, we quickly found out that our main constraint here is the computational cost. We found that more heads leads to better performance, but also has to be trained for more epochs. Therefor we decided to stick to 1 head for the remainder of the ablations, unless said otherwise. Figure 13 visually shows the results for different number of heads for the "pixar" attribute.
Figure 13. The effect of the number of transformer heads on the "pixar" attribute for the pixel-channel transformer architecture. |
Lastly, as mentioned in the architecture section there are four ways to interpret the bottleneck feature map to get the input sequences for the transformer blocks. In Figure 14 we compare the different variants for the "neanderthal" attribute. For the remainder of the ablations we picked the pixel-channel dual transformer block, because it achieves the lowest CLIP directional loss as shown Figure 15.
Figure 14. The effect of the input sequence type for the "neanderthal" attribute across pixel-channel (pc), channel-pixel (cp), pixel (p), channel (c), and convolution-based (conv) architectures. |
Figure 15. The effect of input sequence type on the directional CLIP loss curve during training. |
Figure 16 shows that AdaGroupNorm slightly outperforms the other temporal embedding modules. Both the normalization and the activation function have no effect on the directional CLIP loss, thus we decided to keep them the same as the original paper for the remainder of the ablations (SiLU and GroupNorm).
Figure 16. The effect of the temporal embedding module (left), normalization module (middle), and activation function module (right) on the directional CLIP loss curve during training. |
Based on the results of our ablation study, we conclude that an optimal architecture consists of (1) pixel-channel DualTransformer blocks, (2) AdaGroupNorm temporal embedding module, (3) GroupNorm normalization, and (4) SiLU activation function. In Table 5, we compare the performance of the model to the original implementation in terms of
Model | Smiling (IN) | Sad (IN) | Tanned (IN) | Pixar (UN) | Neanderthal (UN) |
---|---|---|---|---|---|
Original | 89.2 | 92.9 | 100.5 | 125.8 | 125.8 |
Ours | 84.3 | 88.8 | 82.2 | 83.7 | 87.0 |
Table 5. Comparison of Frechet Inception Distance ( unseen-domain (UN) attributes between the original model and our best model. |
As detailed in the reproduction section, retraining for a single attribute already requires a significant amount of time even with the hyperparameters known. If the method was to be used in practise it is not realistic to hyperparameter tune from scratch for every new attribute. Therefor we looked into how the model performs while using a standard set of parameters instead. Note that the original paper uses stochastic gradient descent and a very high learning rate to train, which notoriously requires comparatively more tuning than an Adam optimizer.
This is convenient as the transformer modules are trained with an Adam optimizer anyway. While we tried to use Adam to optimize the original architecture, this resulted in very poor results. In order to demonstrate the significance of hyperparameters, we utilized both the original architecture optimized with SGD and the transformer-based architecture to train the method for a new attribute, employing non-tuned standard parameters. Figure 17 shows the results for the attribute "goblin", highlighting that the output non-tuned transformer-based approach gives a relatively better performance.
Figure 17. Comparison of convolution-based and transformer-based architecture output for a new "goblin" attribute without hyperparameter tuning. |
During inference an interesting hyperparameter is the editing strength and its relation to the number of heads. It appears that as the number of heads increases, the magnitude of editing strength needed decreases. In other words, we can see a trend where better models can edit more subtly. While this might be computationally unfeasible to use this in practise right now, this does hint that there exist good editing directions in the bottleneck. The results for different editing strengths is shown in Figure 18.
Figure 18. The effect of the editing strength when using pixel-channel transformer with various numbers of heads on the "pixar" attribute. |
The significant cost of training a new model for each editing direction makes the application of this model in many practical tasks prohibitively expensive in terms of compute power and ease of use. While it would not eliminate this problem entirely, good transfer performance would alleviate these problems somewhat. We show that transfer learning is possible for our pixel-channel architecture by retraining it on a different editing direction and that this is signficantly faster than training a new direction from scratch. Figure 19 shows the result of retraining a model trained on the "pixar" attribute on the "modigliani" attribute. We can see after a signficant number of steps, the model previously trained on a different attribute still has a lower loss than a model that is trained from scratch.
Figure 19. Retraining from a different trains faster than training from scratch. Left the loss curve, right the results after 2000 steps |
Lastly in this blog post we set out to investigate whether Asyrp can also be applied on top of a latent diffusion model. Since LDMs currently represent the state-of-the-art in image generation [16], it is reasonable to find out if modifications in the h-space lead to meaningful attribute edits in the original images. Conveniently DDIM, the algorithm on which Asyrp was build, is also the algorithm behind LDMs. However, the diffusion process runs in the latent space instead of the pixel space. A sperate VQ-VAE is trained [19], where the encoder
However, to calculate the directional CLIP loss both the reference and the generated image are needed, but the whole point of LDMs is that you do not calculate those every step. One aproach to still use the Asyrp algorithm could be to retrain CLIP for LDM latents instead of images, but this is beyond our scope. Therefor we investigated another aproach in which the images are computed from the latents by running the decoder
GIF 1. Running the VQ-VAE decoder on the latent at every time step. |
That being said this section is called future research for a reason. Sadly, the original code-base was not very modular and this made applying Asyrp to another DM or LDM not feasible within the scope of this project. Therefor eventually we decided to keep this as future research.
The Asyrp model presented in the original paper and thoroughly explained in the Discovering Semantic Latent Space and Model Architecture sections, successfully discovers a semantic latent space in the bottleneck of frozen diffusion models which allows high quality image editing. This is supported by our reproduction study, which was conducted on the CelebA-HQ dataset, using the pretrained DDPM model. The figures in the Reproduction of the Experiments section highlight the editing abilities of the model for both in- (eg. smiling, sad, angry) and unseen-domain (eg. Pixar, Neanderthal, Frida) attributes. For the quantitative evaluation, we used the directional CLIP score, as this was reported in the original paper and the FID score. Both of the two metrics have shown better results for in-domain editing in the case of the reproduction study, and agree with the original findings that Neanderthal is the hardest editing direction. The best results are for "sad" and "smiling". The discovered semantic latent space has the properties of linearity and consistency across timesteps which are validated by our reproduction experiments.
We explored the limitations of the orginal model and discovered two main problems: first, that it is heavily biased and section Bias of the Asyrp model shows that the model performs much better on editing images of caucasian people than non-caucasian and also that individuals frequently change gender and eye color and second, that the model needs retraining and has a different hyperparamenter configuration for each attribute.
We further investigated the capabilities of Asyrp by changing its architecture from convolutional layers to a transformer encoder, as it was presented in the Model Architecture section. We then conducted an ablation study on this new architecture and shown the impact of distinct ways of attending to the bottleneck feature map, different ways of aggregating the temporal encodings, various normalization methods and activation functions. We concluded that our best model outperforms the original Asyrp by evaluating both qualitatively and quantitatively, as it was shown in the Ablation study section. We got a better FID score than the orignal model and also, by looking at the figures we clearly observed that our model captures and edits more fine grained features, thus having a stronger impact on the quality of the edited image.
- Jonathan: Initial setup of codebase & implementation of architecture ablation studies, implementation of transformer architectures, implementation of FID metric, training of ablation models, wrote novel model architecture. Loss plots
- Ana: did ablation study for activation functions and normalization, bias research, model architecture diagram, wrote Image Editing with Diffusion Models (partly), Discovering Semantic Latent Space (training loss part), Model Architecture, Evaluating Diffusion Models and Concluding Remarks.
- Luc: did DM vs LDM research and notebook; wrote header, Image Editing Using Diffusion Models (partly), Recap on Diffusion Models, the Discovering Semantic Latent Space, Bias in Editing Directions (partly), Ablation Study, and Further Research: Latent Diffusion Models.
- Eric: Structure of the repository, implementations, reproducibility experiment configurations, executions, visualizations, and analysis for "Reproduction of the Experiments", collaboration on the initial implementation of the transformer-based architecture, help as needed with run executions, figures, and tables for other sections.
[1] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In: CVF International Conference on Computer Vision (ICCV). 2021, pp. 14347–14356.
[2] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. “Arcface: Additive angular margin loss for deep face recognition”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 4690–4699.
[3] Prafulla Dhariwal and Alexander Nichol. “Diffusion models beat gans on image synthesis”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 8780–8794.
[4] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. “StyleGAN-NADA: CLIP-guided domain adaptation of image generators”. In: ACM Transactions on Graphics (TOG) 41.4 (2022), pp. 1–13.[18]
[5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 6840–6851.
[6] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive growing of gans for improved quality, stability, and variation”. In: arXiv preprint arXiv:1710.10196 (2017).
[7] Gwanghyun Kim and Jong Chul Ye. “Diffusionclip: Text-guided image manipulation using diffusion models”. In: (2021).
[8] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. “Diffusion models already have a semantic latent space”. In: arXiv preprint arXiv:2210.10960 (2022).
[9] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. “More control for free! image synthesis with semantic diffusion guidance”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023, pp. 289–299.
[10] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. “Sdedit: Image synthesis and editing with stochastic differential equations”. In: arXiv preprint arXiv:2108.01073 (2021).
[11] Alexander Quinn Nichol and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models”. In: International Conference on Machine Learning. PMLR. 2021, pp. 8162–8171.
[12] Yong-Hyun Park, Mingi Kwon, Junghyo Jo, and Youngjung Uh. “Unsupervised Discovery of Semantic Latent Directions in Diffusion Models”. In: arXiv preprint arXiv:2302.12469 (2023).
[13] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. “Styleclip: Text-driven manipulation of stylegan imagery”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 2085–2094.
[14] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. “Diffusion autoencoders: Toward a meaningful and decodable representation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 10619-10629.
[15] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual models from natural language supervision”. In: International conference on machine learning. PMLR. 2021, pp. 8748–8763.
[16] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. 2021. arXiv: 2112.10752 [cs.CV].
[17] Jiaming Song, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models”. In: arXiv preprint arXiv:2010.02502 (2020).
[18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[19] Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu. "Neural Discrete Representation Learning". Advances in neural information processing systems 30 (2017).