-
Notifications
You must be signed in to change notification settings - Fork 285
Open
Labels
Description
Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.
Describe the bug
RuntimeError: Quantization failed for Qwen/Qwen3.5-4B with exit code 1
Calibrating for FP8_DEFAULT_CFG(effective-bits: 8.0): 100%|██████████| 43/43 [00:17<00:00, 2.49it/s]/s]
/opt/Model-Optimizer/modelopt/torch/quantization/plugins/huggingface.py:1194: UserWarning: AutoQuantize: Huggingface model detected - Enabling gradient checkpointing.
Disable gradient checkpointing after AutoQuantize if this is not desired!
warnings.warn(
AutoQuantize: Enabling gradient for param embed_tokens.weight.
Estimating auto_quantize scores: 0%| | 0/10 [00:00<?, ?it/s]| 46/256 [00:18<01:24, 2.48it/s]| reserved: 1.35e+05 | max_reserved: 1.35e+05
Traceback (most recent call last):
File "/opt/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1215, in <module>
main(args)
File "/opt/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1193, in main
quantize_main(
File "/opt/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 878, in quantize_main
auto_quantize(
File "/opt/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 278, in auto_quantize
language_model, _ = mtq.auto_quantize(
^^^^^^^^^^^^^^^^^^
File "/opt/Model-Optimizer/modelopt/torch/quantization/model_quant.py", line 495, in auto_quantize
searcher.search(model, constraints, config=search_config) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Model-Optimizer/modelopt/torch/opt/searcher.py", line 151, in search
self.before_search()
File "/opt/Model-Optimizer/modelopt/torch/quantization/algorithms.py", line 637, in before_search
self.estimate_sensitivity_scores()
File "/opt/Model-Optimizer/modelopt/torch/quantization/algorithms.py", line 970, in estimate_sensitivity_scores
self._estimate_auto_quantize_scores(is_param_grad_enabled)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/Model-Optimizer/modelopt/torch/quantization/algorithms.py", line 935, in _estimate_auto_quantize_scores
self._run_func(
File "/opt/Model-Optimizer/modelopt/torch/quantization/algorithms.py", line 580, in _run_func
func(self.model, data)
File "/opt/Model-Optimizer/modelopt/torch/quantization/algorithms.py", line 821, in forward_backward_step
loss = self.config["loss_func"](output, data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 263, in loss_func
return output.loss
^^^^^^^^^^^
- ?
Steps/Code to reproduce bug
--qformat nvfp4,fp8 --auto_quantize_bits 4 with HF_ptq and any qwen3.5 model.
See https://huggingface.co/docs/transformers/model_doc/qwen3_5
- ?
Expected behavior
Who can help?
- ?
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
- Python: ?
- ModelOpt version or commit hash: ?
- CUDA: ?
- PyTorch: ?
- Transformers: ?
- TensorRT-LLM: ?
- ONNXRuntime: ?
- TensorRT: ?
- Any other details that may help: ?
Reactions are currently unavailable