In this project I propose an implementation and possible lightweight variants of DiffiT, a paper from NVidia Labs published in 2023 DiffiT: Diffusion Vision Transformers for Image Generation that achieve the SOTA result if image generation using a diffusion model with Vision Transformers (ViTs) introducing a novel feature: the Time-Dependent Multi-Head Self-Attention mechanism. DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset !
In this implementation I also present a possible lightweight version of the architecture using Depthwise Separable Convolutions and MetaFormers (in particular PoolFormers). The former allows us to decrease the high number of parameters of the architecture without losing so much in accuracy, while the latter is a way to decrease the "high demanding resources" problem of the transformer architecture.
Diffusion models, have emerged as a powerful class of probabilistic models for generating complex data distributions. These models operate by simulating the diffusion process, wherein data is gradually transformed into noise through a series of stochastic steps. The generative process is then the reverse of this diffusion, starting from pure noise and gradually refining the data to approximate the original distribution. The foundational concept of diffusion models can be traced to the field of physics, particularly the study of thermodynamic processes where systems evolve over time towards equilibrium.
Vision Transformers (ViTs) the transformative power of the Transformer architecture originally designed for natural language processing. Unlike traditional convolutional neural networks (CNNs), which have dominated the vision landscape for decades (and are still quite used for a variety of tasks, especially involving low resources and constrained architectures), Vision Transformers utilize self-attention mechanisms to model the global relationships between different parts of the same image.
This novel introduction allows the transformer to gather spatio-temporal informations regarding input data, thanks to its time dependency. Let
where
The initial equations listed above are equivalent to linear projection of each spatial token concatenated with linear projection of each time token. As a result, Query, Key and Values are are all linear functions of both time and spatial tokens.
This is a really good thing, since in this way, Query, Key and Values can modify their behavious adaptively according to the timestep! In other words, attention module can learn to act differently based on the timestep in which diffusion process is 🙂.
In addition to replicating the architecture—a task that has proven to be quite challenging—I have shifted the project towards an optimization perspective. I have analyzed various optimization metrics that can be computed for architectures and have undertaken the following initiatives:
- Firstly, I developed a variant of the DiffiT architecture based on grayscale images. As expected, the performance difference from the RGB version was minimal.
- Secondly, I created a new variant, named DiffiP, which incorporates the PoolFormer in place of the traditional Transformer and Depthwise Separable Convolutions instead of "classical" ones. I've tried to leverage the metaFormer paradigm to optimize our network without a big loss in the accurary point of view.