Merge pull request #14 from Stability-AI/stablewaheed

Added quantization example
Stability-AI · Jan 16, 2025 · 21bbb57 · 21bbb57
2 parents e65b914 + 52d3342
commit 21bbb57
Show file tree

Hide file tree

Showing 17 changed files with 320 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -4,3 +4,29 @@ A collection of code samples for working with Stability AI's models. This repo w
 ![Image-to-Image](./images/screenshot_image_to_image.png)
 
 ![Inpainting](./images/screenshot_inpainting.png)
+
+## Stable Diffusion 3.5 Inference Speeds
+|Model|Inference Speed (seconds)|GPU|
+|-----|-------------------------|---|
+|SD3.5 M|4 s|NVIDIA H100 GPU with 80 GB of VRAM|
+|[4-Bit Quanitized SD3.5 L](/sd35-text-to-image-quantized-gradio/)|18 s|NVIDIA H100 GPU with 80 GB of VRAM|
+|SD3.5 L|7 s|NVIDIA H100 GPU with 80 GB of VRAM|
+
+## Stable Diffusion 3.5 Prompt Tuning Using Guidance Scale
+The [guidance_scale](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.guidance_scale) parameter has a significant impact on image generation with Stable Diffusion 3.5 models:
+> A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality
+
+Image quality can vary drastically based on the `guidance_scale` value. The below screenshots provide some recommended `guidance_scale` settings for three Stable Diffusion 3.5 models:
+* [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) (SD3.5 L)
+  * [Sample code](./sd35-text-to-image-gradio/app.py)
+* [4-Bit Quantized Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) (NF4 SD3.5 L)
+  * NF4: [Normal Floating Point 4](https://huggingface.co/docs/diffusers/v0.32.2/en/quantization/bitsandbytes#normal-float-4-nf4)
+  * [Sample code](./sd35-text-to-image-quantized-gradio/app.py)
+* [Stable Diffusion 3.5 Medium](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) (SD3.5 M)
+
+### Guidance Scale Examples
+|Model|[guidance_scale](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.guidance_scale) (float 1-10)|Example|
+|-----|--------------|-------|
+|SD3.5 L|`guidance_scale=2.5`|![sd3.5 L guidance_scale=2.5](./images/guidance-scale-examples/sd3.5%20L%20guidance_scale=2.5.png)|
+|NF4 SD3.5 L|`guidance_scale=7.5`|![nf4 sd3.5 L guidance_scale=7.5](./images/guidance-scale-examples/nf4%20sd3.5%20L%20guidance_scale=7.5.png)|
+|SD3.5 M|`guidance_scale=5.0`|![sd3.5 M guidance_scale=5](./images/guidance-scale-examples/sd3.5%20M%20guidance_scale=5.png)|
diff --git a/images/guidance-scale-examples/nf4 sd3.5 L guidance_scale=7.5.png b/images/guidance-scale-examples/nf4 sd3.5 L guidance_scale=7.5.png
diff --git a/images/guidance-scale-examples/sd3.5 L guidance_scale=2.5.png b/images/guidance-scale-examples/sd3.5 L guidance_scale=2.5.png
diff --git a/images/guidance-scale-examples/sd3.5 M guidance_scale=5.png b/images/guidance-scale-examples/sd3.5 M guidance_scale=5.png
diff --git a/sd35-image-to-image-flask/README.md b/sd35-image-to-image-flask/README.md
@@ -1,7 +1,7 @@
 # Stable Diffusion 3.5 Image-to-Image Python Flask App
 This repo folder is for making a simple Stable Diffusion 3.5 Image-to-Image API, using Python Flask
 
-**Estimated Inference Speed:** 23 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
+**Estimated Inference Speed:** 7 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
 
 **[Postman](https://www.postman.com/downloads/) Screenshot:**
 ![Postman Screenshot](./images/postman_screenshot.png)

diff --git a/sd35-image-to-image-gradio/README.md b/sd35-image-to-image-gradio/README.md
@@ -1,11 +1,11 @@
 # Stable Diffusion 3.5 Image-to-Image in Gradio
 Gradio demo of [image-to-image](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img) using Stable Diffusion 3.5 Medium
 
-**Estimated Inference Speed:** 23 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
+**Estimated Inference Speed:** 7 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
 
 Full documentation is available on Hugging Face: [Stable Diffusion Image-to-image](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img)
 
-### Screen Shot
+### Screenshot
 ![Screenshot](./images/screenshot.png)
 
 ## Quick Start
@@ -66,28 +66,32 @@ init_image = init_image.resize((640, 1536))
 ```
 #### 1536x1536
 
-  ![1536x1536](./images/input-image-size-examples/1536x1536.png)
+![1536x1536](./images/input-image-size-examples/1536x1536.png)
 
 #### 640x640
 
-  ![640x640](./images/input-image-size-examples/640x640.png)
+![640x640](./images/input-image-size-examples/640x640.png)
 
 #### 64x64
 
-  ![64x64](./images/input-image-size-examples/64x64.png)
+![64x64](./images/input-image-size-examples/64x64.png)
 
 #### 20x20
 
-  ![20x20](./images/input-image-size-examples/20x20.png)
+![20x20](./images/input-image-size-examples/20x20.png)
 
 #### 1x1536
 
-  ![1x1536](./images/input-image-size-examples/1x1536.png)
+**NOTE:** The error is due to the [Pillow](https://pypi.org/project/pillow/) [PIL.Image.resize()](https://github.com/Stability-AI/stability-ai-toolkit/blob/main/sd35-image-to-image-gradio/app.py#L56) method not liking the resize dimensions. Developers should test if SD3.5 image-to-image can tolerate these dimensions
+
+![1x1536](./images/input-image-size-examples/1x1536.png)
 
 #### 5x12
 
-  ![5x12](./images/input-image-size-examples/5x12.png)
+**NOTE:** The error is due to the [Pillow](https://pypi.org/project/pillow/) [PIL.Image.resize()](https://github.com/Stability-AI/stability-ai-toolkit/blob/main/sd35-image-to-image-gradio/app.py#L56) method not liking the resize dimensions. Developers should test if SD3.5 image-to-image can tolerate these dimensions
+
+![5x12](./images/input-image-size-examples/5x12.png)
 
 #### 640x1536
 
-  ![640x1536](./images/input-image-size-examples/640x1536.png)
+![640x1536](./images/input-image-size-examples/640x1536.png)
diff --git a/sd35-image-to-image-gradio/example_prompts.txt b/sd35-image-to-image-gradio/example_prompts.txt
@@ -2,18 +2,18 @@ positive prompt:
 Replace the soldiers with elves holding bows and arrows
 
 positive prompt:
-Replace the soldiers with elves holding crossbows, first-person-shooter screen shot, 4k
+Replace the soldiers with elves holding crossbows, first-person-shooter screenshot, 4k
 The elves are wearing hoods
 
 
 positive prompt:
-Replace the soldiers with elves holding bows and arrows, first-person-shooter screen shot, 4k
+Replace the soldiers with elves holding bows and arrows, first-person-shooter screenshot, 4k
 The elves are wearing hoods
 There is a dragon flying in the sky
 
 
 positive prompt:
-Replace the soldiers with elves holding bows and arrows, video game screen shot, 4k
+Replace the soldiers with elves holding bows and arrows, video game screenshot, 4k
 The elves are wearing hoods. There is one dragon flying in the sky
 
 negative prompt:

diff --git a/sd35-inpainting-gradio/README.md b/sd35-inpainting-gradio/README.md
@@ -1,9 +1,9 @@
 # Stable Diffusion 3.5 Inpainting in Gradio
 Gradio demo of inpainting using Stable Diffusion 3.5 Large
 
-**Estimated Inference Speed:** 23 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
+**Estimated Inference Speed:** 7 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
 
-### Screen Shot
+### Screenshot
 ![screenshot.png](./images/screenshot.png)
 
 #### Input Image and Gradio ImageMask

diff --git a/sd35-text-to-image-gradio/README.md b/sd35-text-to-image-gradio/README.md
@@ -3,9 +3,9 @@ Gradio demo of [text-to-image](https://huggingface.co/docs/diffusers/api/pipelin
 
 Full documentation is available on Hugging Face: [Stable Diffusion Text-to-image](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img)
 
-**Estimated Inference Speed:** 23 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
+**Estimated Inference Speed:** 7 seconds for Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
 
-### Screen Shot
+### Screenshot
 ![Screenshot](./images/screenshot.png)
 
 ## Quick Start

diff --git a/sd35-text-to-image-gradio/app.py b/sd35-text-to-image-gradio/app.py
@@ -20,7 +20,6 @@
 import torch
 import os
 
-from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
 from diffusers import StableDiffusion3Pipeline
 from huggingface_hub import login
 
@@ -43,6 +42,16 @@ def login_to_hugging_face(self):
             login()
             print("\nWARNING: To avoid the Hugging Face login prompt in the future, please set the HF_TOKEN environment variable:\n\n    export HF_TOKEN=<YOUR HUGGING FACE USER ACCESS TOKEN>\n")
 
+    def _check_shader(self):
+        if torch.backends.mps.is_available():
+            device = "mps"
+        elif torch.cuda.is_available():
+            device = "cuda"
+        else:
+            device = "cpu"
+
+        return device
+
     def _predict(self, guidance_scale, prompt, negative_prompt, progress=gr.Progress(track_tqdm=True)):
         images = self._pipe(
             prompt=prompt,
@@ -65,26 +74,11 @@ def _start_gradio(self):
         ).launch(debug=True, share=True)
 
     def start_text_to_image(self):
-        model_id = "stabilityai/stable-diffusion-3.5-large"
-
-        nf4_config = BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_compute_dtype=torch.bfloat16
-        )
-        model_nf4 = SD3Transformer2DModel.from_pretrained(
-            model_id,
-            subfolder="transformer",
-            quantization_config=nf4_config,
-            torch_dtype=torch.bfloat16
-        )
-
         self._pipe = StableDiffusion3Pipeline.from_pretrained(
-            model_id, 
-            transformer=model_nf4,
-            torch_dtype=torch.bfloat16
+            "stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16
         )
-        self._pipe.enable_model_cpu_offload()
+        device = self._check_shader()
+        self._pipe.to(device)
 
         self._start_gradio()
         return 0

diff --git a/sd35-text-to-image-gradio/example_prompts.txt b/sd35-text-to-image-gradio/example_prompts.txt
@@ -2,4 +2,11 @@ positive prompt:
 Children's birthday party
 
 negative prompt:
-No birthday cake
+No birthday cake
+
+
+positive prompt:
+A group of elves hunting a dragon, 4k cinema
+
+negative prompt:
+No green grass
diff --git a/sd35-text-to-image-gradio/images/screenshot.png b/sd35-text-to-image-gradio/images/screenshot.png
diff --git a/sd35-text-to-image-quantized-gradio/README.md b/sd35-text-to-image-quantized-gradio/README.md
@@ -0,0 +1,106 @@
+# 4-Bit Quantized Stable Diffusion 3.5 Text-to-Image in Gradio
+Gradio demo of [text-to-image](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img) using 4-bit quantized Stable Diffusion 3.5 Large
+
+Full documentation is available on Hugging Face: [Stable Diffusion Text-to-image](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img)
+
+**Estimated Inference Speed:** 18 seconds for quantized Stable Diffusion 3.5 Large on an NVIDIA H100 GPU
+
+### Screenshot
+![Screenshot](./images/screenshot.png)
+
+## Quick Start
+1. Open a web browser, log in to Hugging Face and register your name and email,
+   to use [stable-diffusion-3.5-large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large)
+2. Create a new Hugging Face [user access token](https://huggingface.co/docs/hub/en/security-tokens),
+   which will capture that you completed the registration form
+3. Clone this repo to your machine and change into the directory for this demo:
+   ```
+   cd ./stability-ai-toolkit/sd35-text-to-image-gradio
+   ```
+4. Set up the app in a Python virtual environment:
+
+   ```
+   python -m venv <your_environment_name>
+   source <your_environment_name>/bin/activate
+   ```
+5. Set your `HF_TOKEN` inside your virtual environment
+   ```
+   export HF_TOKEN=<Hugging Face user access token>
+   ```
+6. Install dependencies
+   ```
+   pip install -r requirements.txt
+   ```
+
+   NOTE: Read [requirements.txt](./requirements.txt) for
+   [MacOS PyTorch installation instructions](https://developer.apple.com/metal/pytorch/)
+
+   TL;DR:
+   ```
+   # Inside your virtual environment
+   pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+   ```
+7. Start the app
+   ```
+   python app.py
+   ```
+8. Open UI in a web browser: [http://127.0.0.1:7861](http://127.0.0.1:7861)
+
+## How to Quantize Stable Diffusion 3.5 Large
+### [With Quantization](./app.py)
+```
+import torch
+from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
+from diffusers import StableDiffusion3Pipeline
+...
+model_id = "stabilityai/stable-diffusion-3.5-large"
+
+nf4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16
+)
+model_nf4 = SD3Transformer2DModel.from_pretrained(
+    model_id,
+    subfolder="transformer",
+    quantization_config=nf4_config,
+    torch_dtype=torch.bfloat16
+)
+
+pipe = StableDiffusion3Pipeline.from_pretrained(
+    model_id, 
+    transformer=model_nf4,
+    torch_dtype=torch.bfloat16
+)
+pipe.enable_model_cpu_offload()
+```
+### [Without Quantization](/sd35-text-to-image-gradio/app.py)
+```
+import torch
+from diffusers import StableDiffusion3Pipeline
+...
+model_id = "stabilityai/stable-diffusion-3.5-large"
+
+pipe = StableDiffusion3Pipeline.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16
+)
+```
+
+## Why Use Quantized Stable Diffusion 3.5 Large
+
+**NOTE:** There is a **SIGNIFICANT IMPROVEMENT** in **NEGATIVE PROMPTING** accuracy, when using 4-bit quantized Stable Diffusion 3.5 Large
+
+Many use cases for [Stable Diffusion 3.5 Large](https://huggingface.co/stabilityai/stable-diffusion-3.5-large) (SD3.5 L) require the algorithms of the model, without the large memory footprint:
+* 4-bit quantization of SD3.5 L allows it to load onto GPUs with limited VRAM
+* 4-bit quantization makes it easier to offload certain parts of model execution to the CPU, further reducing GPU memory usage
+* There is often an acceptable decrease in generate image quality, with the benefit of a reduced cost due to reduced VRAM
+* Users working on their own computer with a retail GPU (or Apple Silicon with an integrated GPU) would benefit from this use case
+* [Stable Diffusion 3.5 Medium](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) (SD3.5 M) could alternatively be used as it has fewer parameters than Large and an inference speed that's even faster than quantized SD3.5 L
+
+### Stable Diffusion 3.5 Inference Speeds
+|Model|Inference Speed (seconds)|GPU|
+|-----|-------------------------|---|
+|SD3.5 M|4 s|NVIDIA H100 GPU with 80 GB of VRAM|
+|[4-Bit Quanitized SD3.5 L](/sd35-text-to-image-quantized-gradio/)|18 s|NVIDIA H100 GPU with 80 GB of VRAM|
+|SD3.5 L|7 s|NVIDIA H100 GPU with 80 GB of VRAM|