NVFP4 quant by GokaNik · Pull Request #91 · kandinskylab/kandinsky-5

GokaNik · 2025-12-08T16:19:00Z

Title: Add NVFP4 weights-only quantization support to fit Kandinsky-5 Pro into 24 GB VRAM

Summary

This PR adds support for weights-only NVFP4 quantization for the model so that it can run on GPUs with 24 GB VRAM (with offload=True).

Changes

Integrated NVFP4 weights-only quantization using NVIDIA ModelOpt.
Added script create_nvfp4_weights.py to generate NVFP4-quantized checkpoints from the original weights.
Re-enabled torch.compile in kandinsky/models/dit.py, which makes NVFP4-quantized runs work significantly better; the baseline model path has also been verified to work with it.
Extended model loading to support:
- model_type="quantized"
- quantized_model_path="kandinsky-5/K5Pro_nvfp4.pth"

New dependency

Before using NVFP4 quantization, install NVIDIA ModelOpt:

pip install -U "nvidia-modelopt[all]"

Supported modes (24 GB)

On a 24 GB GPU, the NVFP4-quantized weights allow the following modes to run with offload=True:

5s sft sd
- 768×512
- 512×768
- 512×512
10s sft sd
- 512×512

Note: all listed configurations fit into 24 GB VRAM only when running with offload=True.

How to generate NVFP4 weights

Use the provided script to create a quantized checkpoint from the original model weights:

python create_nvfp4_weights.py

The script produces, e.g.:

kandinsky-5/K5Pro_nvfp4.pth

How to run the quantized model

In your config or launch script, select the quantized model type and point to the generated checkpoint, for example:

pipe = get_T2V_pipeline(
    device_map=device_map,
    conf_path=args.config,
    offload=True,
    magcache=args.magcache,
    quantized_qwen=args.qwen_quantization,
    attention_engine=args.attention_engine,
    model_type="quantized",
    quantized_model_path="kandinsky-5/K5Pro_nvfp4.pth",
)

GokaNik · 2025-12-10T09:58:39Z

For reference, here are some performance and VRAM measurements of the NVFP4-quantized model with FlashAttention-3 on an H200 GPU. Both baseline and quantized runs use FlashAttention-3 and offload=True.

10s sft sd — speed (512×512)

Metric / run	Base (full-precision)	Quant (NVFP4)	Difference (Quant vs Base)
Initialization (pipeline load)	91.004 s	267.822 s	+194.2% (slower)
Generation 1	693.033 s	713.246 s	+2.92%
Generation 2	636.360 s	665.650 s	+4.60%
Generation 3	621.406 s	656.830 s	+5.70%
Generation 4	618.401 s	652.396 s	+5.50%
Total generation time (4 videos)	2569.200 s (42 m 49 s)	2688.122 s (44 m 48 s)	+4.63%
Avg time per generation	642.30 s	672.03 s	+4.63%
Total (init + 4 generations)	2660.204 s (44 m 20 s)	2955.944 s (49 m 15 s)	+11.12%

10s sft sd — VRAM behavior (512×512)

Below is the VRAM usage over time for the quantized 10s sft sd 512×512 run:

The maximum VRAM usage in this run is 24 215.8 MB (~23.6 GB), i.e. just below the 24 GB cap. This confirms that the 10s sft sd, 512×512, offload=True, NVFP4 configuration fits into 24 GB of VRAM as claimed in the PR description.

5s sft sd — speed (512×768)

Metric / run	Base (full-precision)	Quant (NVFP4)	Difference (Quant vs Base)
Initialization (pipeline load)	89.709 s	258.589 s	+188.3% (slower)
Generation 1	649.211 s	656.750 s	+1.16%
Generation 2	600.772 s	616.600 s	+2.64%
Generation 3	591.094 s	619.643 s	+4.83%
Generation 4	593.580 s	618.540 s	+4.20%
Total generation time (4 videos)	2434.657 s (40 m 34 s)	2511.533 s (41 m 51 s)	+3.16%
Avg time per generation	608.66 s	627.88 s	+3.16%
Total (init + 4 generations)	2524.366 s (42 m 4 s)	2770.122 s (46 m 10 s)	+9.74%

In short, NVFP4 quantization keeps per-generation time within roughly 3–5% of the baseline; the main overhead is in the first initialization due to compilation and quantized weights loading.

5s sft sd — VRAM behavior (512x768)

Below is the VRAM usage over time for an earlier quantized run on the same H200:

There is a transient decoder peak of ~35.5 GB when the full GPU memory is available. The decoder allocates memory “as needed”, so when we artificially constrain the available VRAM to 24 GB, it adjusts its allocations and still fits without OOM, which matches the supported 24 GB modes in the PR description.

GokaNik added 5 commits December 8, 2025 15:41

add weights-only nvfp4 quantization for 24GB VRAM support

29feae6

add weights-only nvfp4 quantization for 24GB VRAM support

121814a

add weights-only nvfp4 quantization for 24GB VRAM support

c99e310

add weights-only nvfp4 quantization for 24GB VRAM support

d429d01

add weights-only nvfp4 quantization for 24GB VRAM support

38ee58f

Update layers in create_nvfp4_weights.py

44f60fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

NVFP4 quant#91

NVFP4 quant#91
GokaNik wants to merge 6 commits intokandinskylab:mainfrom
GokaNik:nvfp4_quant

GokaNik commented Dec 8, 2025

Uh oh!

GokaNik commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

GokaNik commented Dec 8, 2025

Summary

Changes

New dependency

Supported modes (24 GB)

How to generate NVFP4 weights

How to run the quantized model

Uh oh!

GokaNik commented Dec 10, 2025

10s sft sd — speed (512×512)

10s sft sd — VRAM behavior (512×512)

5s sft sd — speed (512×768)

5s sft sd — VRAM behavior (512x768)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant