Skip to content

Comments

NVFP4 quant#91

Open
GokaNik wants to merge 6 commits intokandinskylab:mainfrom
GokaNik:nvfp4_quant
Open

NVFP4 quant#91
GokaNik wants to merge 6 commits intokandinskylab:mainfrom
GokaNik:nvfp4_quant

Conversation

@GokaNik
Copy link

@GokaNik GokaNik commented Dec 8, 2025

Title: Add NVFP4 weights-only quantization support to fit Kandinsky-5 Pro into 24 GB VRAM

Summary

This PR adds support for weights-only NVFP4 quantization for the model so that it can run on GPUs with 24 GB VRAM (with offload=True).

Changes

  • Integrated NVFP4 weights-only quantization using NVIDIA ModelOpt.
  • Added script create_nvfp4_weights.py to generate NVFP4-quantized checkpoints from the original weights.
  • Re-enabled torch.compile in kandinsky/models/dit.py, which makes NVFP4-quantized runs work significantly better; the baseline model path has also been verified to work with it.
  • Extended model loading to support:
    • model_type="quantized"
    • quantized_model_path="kandinsky-5/K5Pro_nvfp4.pth"

New dependency

Before using NVFP4 quantization, install NVIDIA ModelOpt:

pip install -U "nvidia-modelopt[all]"

Supported modes (24 GB)

On a 24 GB GPU, the NVFP4-quantized weights allow the following modes to run with offload=True:

  • 5s sft sd

    • 768×512
    • 512×768
    • 512×512
  • 10s sft sd

    • 512×512

Note: all listed configurations fit into 24 GB VRAM only when running with offload=True.

How to generate NVFP4 weights

Use the provided script to create a quantized checkpoint from the original model weights:

python create_nvfp4_weights.py

The script produces, e.g.:

kandinsky-5/K5Pro_nvfp4.pth

How to run the quantized model

In your config or launch script, select the quantized model type and point to the generated checkpoint, for example:

pipe = get_T2V_pipeline(
    device_map=device_map,
    conf_path=args.config,
    offload=True,
    magcache=args.magcache,
    quantized_qwen=args.qwen_quantization,
    attention_engine=args.attention_engine,
    model_type="quantized",
    quantized_model_path="kandinsky-5/K5Pro_nvfp4.pth",
)

@GokaNik
Copy link
Author

GokaNik commented Dec 10, 2025

For reference, here are some performance and VRAM measurements of the NVFP4-quantized model with FlashAttention-3 on an H200 GPU. Both baseline and quantized runs use FlashAttention-3 and offload=True.


10s sft sd — speed (512×512)

Metric / run Base (full-precision) Quant (NVFP4) Difference (Quant vs Base)
Initialization (pipeline load) 91.004 s 267.822 s +194.2% (slower)
Generation 1 693.033 s 713.246 s +2.92%
Generation 2 636.360 s 665.650 s +4.60%
Generation 3 621.406 s 656.830 s +5.70%
Generation 4 618.401 s 652.396 s +5.50%
Total generation time (4 videos) 2569.200 s (42 m 49 s) 2688.122 s (44 m 48 s) +4.63%
Avg time per generation 642.30 s 672.03 s +4.63%
Total (init + 4 generations) 2660.204 s (44 m 20 s) 2955.944 s (49 m 15 s) +11.12%

10s sft sd — VRAM behavior (512×512)

Below is the VRAM usage over time for the quantized 10s sft sd 512×512 run:

vram_usage_10s_512

The maximum VRAM usage in this run is 24 215.8 MB (~23.6 GB), i.e. just below the 24 GB cap. This confirms that the 10s sft sd, 512×512, offload=True, NVFP4 configuration fits into 24 GB of VRAM as claimed in the PR description.


5s sft sd — speed (512×768)

Metric / run Base (full-precision) Quant (NVFP4) Difference (Quant vs Base)
Initialization (pipeline load) 89.709 s 258.589 s +188.3% (slower)
Generation 1 649.211 s 656.750 s +1.16%
Generation 2 600.772 s 616.600 s +2.64%
Generation 3 591.094 s 619.643 s +4.83%
Generation 4 593.580 s 618.540 s +4.20%
Total generation time (4 videos) 2434.657 s (40 m 34 s) 2511.533 s (41 m 51 s) +3.16%
Avg time per generation 608.66 s 627.88 s +3.16%
Total (init + 4 generations) 2524.366 s (42 m 4 s) 2770.122 s (46 m 10 s) +9.74%

In short, NVFP4 quantization keeps per-generation time within roughly 3–5% of the baseline; the main overhead is in the first initialization due to compilation and quantized weights loading.


5s sft sd — VRAM behavior (512x768)

Below is the VRAM usage over time for an earlier quantized run on the same H200:

GPU VRAM usage over time

There is a transient decoder peak of ~35.5 GB when the full GPU memory is available. The decoder allocates memory “as needed”, so when we artificially constrain the available VRAM to 24 GB, it adjusts its allocations and still fits without OOM, which matches the supported 24 GB modes in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant