This document provides ComfyUI integration patterns for quantization. These examples show how the quantization implementations in this workspace integrate with ComfyUI's inference runtime.
Related files in this workspace:
- quant_ops.py - Layout system matching ComfyUI's QuantizedTensor interface
- convert_to_quant.py - Generates ComfyUI-compatible quantized models
- MANUAL.md - Complete usage guide
- Development: This workspace develops quantization methods (INT8 algorithms, learned rounding)
- Output: Generates
.safetensorsfiles with.comfy_quantmetadata compatible with ComfyUI - Runtime: ComfyUI loads these models using its
quant_ops.py(mirrored in this workspace) - Testing: Load quantized models in ComfyUI to validate quality and performance
Compatibility: The QuantizedTensor and layout system in quant_ops.py matches ComfyUI's quantization interface.
QuantizedTensor class structure:
from convert_to_quant.comfy.quant_ops import QuantizedTensor
class QuantizedTensor(torch.Tensor):
_qdata: torch.Tensor # Quantized data storage
_layout_type: str # Layout identifier (e.g., "TensorCoreFP8Layout")
_layout_params: dict # Scale, orig_dtype, etc.
@classmethod
def from_float(cls, tensor, layout_type, **kwargs):
"""Create quantized tensor from float tensor."""
pass
def dequantize(self):
"""Convert back to original dtype."""
passCreating custom quantization layouts:
from convert_to_quant.comfy.quant_ops import QuantizedLayout, register_layout_op
import torch
class MyCustomLayout(QuantizedLayout):
"""Custom quantization layout for specific use case."""
@classmethod
def quantize(cls, tensor, scale=None, dtype=torch.int8, **kwargs):
"""
Quantize a float tensor.
Args:
tensor: Input float tensor
scale: Quantization scale (computed if None)
dtype: Target quantized dtype
Returns:
Tuple of (quantized_data, layout_params_dict)
"""
if scale is None:
scale = tensor.abs().max() / 127
qdata = (tensor / scale).round().clamp(-128, 127).to(dtype)
layout_params = {
"scale": scale,
"orig_dtype": tensor.dtype,
}
return qdata, layout_params
@staticmethod
def dequantize(qdata, scale, orig_dtype, **kwargs):
"""Dequantize back to original dtype."""
return qdata.to(orig_dtype) * scale
# Register custom operation handler for your layout
@register_layout_op(torch.ops.aten.linear.default, "MyCustomLayout")
def my_custom_linear(func, args, kwargs):
"""
Custom linear operation for MyCustomLayout tensors.
Args:
func: Original torch function
args: Positional arguments (input, weight, bias)
kwargs: Keyword arguments
"""
input_tensor, weight, bias = args[0], args[1], args[2] if len(args) > 2 else None
# Dequantize weight if needed
if isinstance(weight, QuantizedTensor):
weight = weight.dequantize()
# Perform operation
return torch.nn.functional.linear(input_tensor, weight, bias)Mixed precision operations factory:
from convert_to_quant.comfy.quant_ops import QuantizedTensor
# Note: ComfyUI's mixed_precision_ops can be configured to use these tensorsModels quantized with this workspace can be loaded directly in ComfyUI:
# ComfyUI automatically detects .comfy_quant metadata and creates QuantizedTensor wrappers
# No special loading code needed - just use the normal model loader
# The quantized model will have:
# - weight: QuantizedTensor (int8)
# - weight_scale: float32 tensor
# - input_scale: float32 tensor (for INT8)
# - .comfy_quant: metadata tensor- Load model in ComfyUI using standard loader nodes
- Generate test images with known prompts
- Compare outputs with original (non-quantized) model
- Check metrics: visual quality, inference speed, memory usage
- Test edge cases: long prompts, high resolution, multiple batches
Implementation reference: See convert_to_quant.py bias correction logic for quality optimization techniques.
Instead of loading LoRA separately at inference time, you can merge it into the base model and quantize the result:
from convert_to_quant.quantization import convert_to_int8
# Merge single LoRA and quantize
convert_to_int8(
input_file="base_model.safetensors",
output_file="merged_quantized.safetensors",
comfy_quant=True,
merge_lora_path="style_lora.safetensors",
merge_lora_scale=1.0,
optimizer="quip"
)
# Merge multiple LoRAs with automatic dampening
convert_to_int8(
input_file="base_model.safetensors",
output_file="merged_multi_lora.safetensors",
comfy_quant=True,
merge_lora_paths=["style_lora.safetensors", "character_lora.safetensors"],
merge_lora_scale=1.0,
merge_lora_dampen=True,
optimizer="quip"
)# Merge and quantize with QuIP
convert_to_quant -i base_model.safetensors \
--merge-lora style_lora.safetensors \
--optimizer quip \
--comfy_quant
# Merge multiple LoRAs with custom scale
convert_to_quant -i base_model.safetensors \
--merge-loras lora1.safetensors lora2.safetensors \
--merge-lora-scale 0.8 \
--comfy_quantProcess large models on GPUs with limited VRAM by offloading heavy operations to CPU:
# Auto-detect best streaming settings based on available VRAM
convert_to_quant -i large_model.safetensors --streaming-mode auto --comfy_quant
# Aggressive streaming for <8GB VRAM GPUs
convert_to_quant -i large_model.safetensors --streaming-mode aggressive --comfy_quantUse BF16 precision for internal calculations on Ampere+ GPUs (RTX 30/40 series) to speed up quantization:
# Enable BF16 compute (auto mode uses BF16 for large tensors)
convert_to_quant -i model.safetensors --bf16-compute auto --comfy_quantExtreme memory savings (75-90%) for very large layers (e.g., 16k+ dimensions) using gradient checkpointing-style recomputation:
# Enable checkpointed LDLQ for large layers
convert_to_quant -i model.safetensors --optimizer quip --quip-checkpointed --comfy_quantAnalyze a model before quantization to see which layers will be processed:
# Show analysis of layers and expected quantization formats
convert_to_quant -i model.safetensors --dry-run analyze --flux2Generate a JSON template for fine-grained per-layer quantization control:
# Create a template JSON file based on the model structure
convert_to_quant -i model.safetensors --dry-run create-templateModify metadata or remove/add tensors in an already quantized file:
# Remove specific tensors and update metadata
convert_to_quant -i quantized_model.safetensors --edit-quant \
--remove-keys "layer1.weight_scale,layer2.weight_scale" \
--save-quant-metadata- Single file deployment - No separate LoRA loading needed at inference
- Faster inference - No runtime LoRA computation overhead
- Better quantization quality - QuIP can optimize for the merged weights rather than base + adapter separately
- Simpler workflow - One quantized file contains everything needed
When merging multiple LoRAs, automatic dampening prevents over-saturation:
| LoRA Index | Scale Applied | Description |
|---|---|---|
| 1st | 1.0 × --merge-lora-scale |
Full strength |
| 2nd | 0.9 × --merge-lora-scale |
10% reduction |
| 3rd | 0.81 × --merge-lora-scale |
19% reduction |
| nth | 0.9^(n-1) × --merge-lora-scale |
Progressive dampening |
Disable with --merge-lora-dampen=False if you want equal weighting.
- Develop in this workspace: Implement new quantization format in quant_ops.py
- Test with converter: Use convert_to_quant.py to quantize models
- Validate in ComfyUI: Load and test quantized models
- Document findings: Record results in research notes
- Refine implementation: Iterate based on quality/performance metrics
Example: The INT8 block-wise layout was developed using this workflow.