-
Notifications
You must be signed in to change notification settings - Fork 33
[Multi-Modifier] Scoped apply quantization config #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
03fb664
to
550c0ad
Compare
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
f70aedb
to
606f177
Compare
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
5 tasks
FYI #428. Also touches some apply logic and adds more scheme merging |
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
24af65a
to
8259cbb
Compare
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
8259cbb
to
b515c1b
Compare
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
92f8757
to
d2903a1
Compare
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
kylesayrs
previously approved these changes
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
rahul-tuli
reviewed
Sep 16, 2025
rahul-tuli
previously approved these changes
Sep 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! LGTM! 🚀
dsikka
requested changes
Sep 16, 2025
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
98a97e5
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
f7239b1
to
01af659
Compare
kylesayrs
previously approved these changes
Sep 18, 2025
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
dsikka
approved these changes
Sep 18, 2025
brian-dellabetta
added a commit
to vllm-project/llm-compressor
that referenced
this pull request
Sep 22, 2025
…atus (#1772) SUMMARY: Prerequisites: * neuralmagic/compressed-tensors#432 This allows for multi-modifier support by scoping the application of quantization config/status to only the modules in the model that match the given targets/ignore configuration, rather than all modules. Initialization of observers is moved to on_start (instead of on_initialize) to match their removal on_end (and not on_finalize). This prevents collision during the multi-modifier lifecycle - [x] Update AWQ - [x] Update QuantizationModifier - [x] Update QuantizationMixin - [x] Update GPTQ - [x] No other quantization modifiers exist TEST PLAN: - Tests were added to neuralmagic/compressed-tensors#432 to confirm correct application of multiple modifiers. - Added an example in this PR to show how AWQ and GPTQ can be applied heterogeneously to a model, along with a small README. Logs show alternating AWQ and GPTQ messages for `"sequential"`, and correct behavior for `"independent"` pipelines. [Model checkpoint](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-selfattn-w8a8-mlp-w4a16-sequential/tree/main) for the sequential pipeline shows correct application of W8A8 to self_attn layers and W4A16 to mlp layers. config.json and safetensors weights all look as expected --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In order to support multi-modifier recipes (e.g. AWQ+W4A16 on self_attn layers and FP8_DYNAMIC on mlp layers), quantization config and status must be applied only to the modules scoped to the modifier, not all at once. This updates
apply_quantization_config
so that quantization_config and quantization_status are applied just to the target modules, not changed globally across all modules.In order for proper target prioritization,
apply_quantization_status
is performed regardless of what the current status is for the model. Without these changes,test_target_prioritization
will fail.Other small changes:
test_multi_apply_quantization_config
to make sure the application of multiple quantization configs in series works correctly -- shapes are correct and unused parameters are correctly removed.override_quantization_status
in favor of more generalpatch_attr
.infer_quantization_status
which is no longer meaningful at the model level. It is also no longer needed because module's current status isn't checked.ALL_QPARAM_NAMES
constant so that parameters related to quantization can be cleared from modules during init"quant_method": "sparseml"
in favor of"compressed-tensors"
compress_quantized_weights
andapply_quantization_status
. We can removecompress_quantized_weights
and references to it in examples/notebooks in a follow-up PRMerge in conjunction with