-
Notifications
You must be signed in to change notification settings - Fork 12
Frequently asked questions
Unpaint uses Stable Diffusion models to generate images based on user-provided textual descriptions, called prompts. These models have been exposed to a vast array of labeled images and trained on how to remove noise from them. During this process, the models learn the visual patterns associated with certain labels, and later they can synthesize them even if the input is pure noise.
For example the prompt a lone pine tree on a hill
can be drawn by Stable Diffusion, as it has learned the visual concepts of pine tree
, hill
and lone
by being exposed to a large number of photos, drawings and other images which showcase one or more of these.
As Stable Diffusion is open-source there are a large number of Stable Diffusion models which have been specialized in certain areas, such as generating high-quality landscapes, comic book style art or realistic character designs.
Unpaint itself does not contain any Stable Diffusion model - and as such it cannot draw images by itself, for that we must provide a valid Stable Diffusion model which Unpaint can load.
Even the most simple Stable Diffusion images are generated using an involved, multi-stage process:
-
Text tokenization: the user-provided text is split-up into machine-readable integer numbers called tokens. Most tokens represent a human word, such as
space
andwater
are represented by the tokens 7857 and 2505 respectively. - Text encoding: each token is converted to a vector of numbers, which can be thought of coordinates of the concept in an n-dimensional space. Sort of like representing a city's location on a map using latitude and longitude.
- Denoising: we generate an "image" of random noise, then iteratively try to remove the noise from the image. The process is guided by the text prompt describing what should be on it. A good analogy could be looking at clouds and trying to see patterns resembling animals in them.
- Decoding VAE: to increase efficiency, images are encoded into so-called latent space_. Latent space allows processing images at a much lower resolution, where each value represents a block of pixels in the final image. VAE stands for Variadic Auto Encoder, a type of neural network which has been trained to compress and decompress images with minimal visual changes.
Each of these steps uses a different kind of model (neural network) with a different set of weights, where inputs and outputs are to be in a specific format for each step to work, while several mathematical transformations must be performed between steps. These image-generation pipelines are implemented by Unpaint.
As a precaution to avoid vandalism these options cannot be disabled in Unpaint right now.
Stable Diffusion models were trained on billions of images collected from the Internet. Due to the sheer size of the data required for training, it was not possible to verify each image by humans at the time. As such some of these images include data which is generally considered inappropriate, such as images displaying nudity or adult content. As models were trained on these images they have learned these visual concepts and can reproduce them as well.
Without any safeguards these outputs are produced even when the user did not ask for them explicitly - for example, a non-descriptive prompt like something
or picture
can produce explicit content time-to-time.
To combat this behavior Unpaint provides two mechanisms as of now:
- The safety prompt option adds some keywords to the negative prompts, such as
nsfw
andnudity
. This helps to guide the models away from unintended content. - The safety checker option performs a safety check after the image is generated, but before displayed to the user or saved to disk. It validates that the image contains no inappropriate content.
These two options work best together: safety prompt helps to avoid hitting the safety checker randomly.
Some additional notes:
The safety checker is not 100% perfect, and will sometimes flag images which are safe and might allow a more convoluted unsafe result to pass through. This is due to the limitation of the current technology and will improve in the future.
Output will be randomly flagged as it gets more similar to inappropriate images, e.g. a safe beach scene or the average lingerie commercial might be flagged by the system.