-
Notifications
You must be signed in to change notification settings - Fork 160
[OpenVINO] Refactor CLI quantization #1525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
aa0e0c2 to
3639a3f
Compare
d45807f to
6c0660a
Compare
|
Hi @echarlaix @helena-intel @IlyasMoutawwakil @rkazants . Please take a look at this PR. The diff is quite big, but I've prepared an extensive PR description to hopefully ease the review process. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the CLI quantization flow for OpenVINO model exports by moving quantization logic from the command layer to the export layer. The changes enable better default quantization config matching based on model IDs and simplify future enhancements.
Key Changes:
- Quantization application moved from
optimum/commands/export/openvino.pytooptimum/exporters/openvino/__main__.py OVBaseModel._prepare_quantization_config()decoupled into three methods:_resolve_default_quantization_config(),_quantization_config_from_dict(), and_preprocess_quantization_config()- Quantization now uses temporary directories to avoid partial output on failure
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| optimum/intel/openvino/modeling_base.py | Refactored quantization config preparation and application methods |
| optimum/exporters/openvino/main.py | Added generic quantization application logic and helper functions |
| optimum/commands/export/openvino.py | Removed data-aware quantization handling logic |
| optimum/intel/openvino/quantization.py | Updated to save OV config via model object |
| tests/openvino/test_exporters_cli.py | Updated task names and added test skip logic |
| tests/openvino/test_quantization.py | Updated stack frame depth for assertion |
| optimum/intel/openvino/modeling_*.py | Added _preprocess_quantization_config() implementations for various model types |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
echarlaix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @nikita-savelyevv!!
| from ...intel.openvino.configuration import _GPTOSSQuantizationConfig | ||
| from ...intel.openvino.utils import TemporaryDirectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible would replace with absolute imports
| from ...intel.openvino.configuration import _GPTOSSQuantizationConfig | |
| from ...intel.openvino.utils import TemporaryDirectory | |
| from optimum.intel.openvino.configuration import _GPTOSSQuantizationConfig | |
| from optimum.intel.openvino.utils import TemporaryDirectory |
|
|
||
| from optimum.intel.openvino.quantization import _weight_only_quantization | ||
|
|
||
| from ...intel.openvino.configuration import _GPTOSSQuantizationConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not ?
| from ...intel.openvino.configuration import _GPTOSSQuantizationConfig | |
| from optimum.intel.openvino.configuration import _GPTOSSQuantizationConfig |
| else: | ||
| try: | ||
| model_cls_name = _HEAD_TO_AUTOMODELS[task.replace("-with-past", "")] | ||
| if model_cls_name == "OVModelForFeatureExtraction" and library_name == "sentence_transformers": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't
| if model_cls_name == "OVModelForFeatureExtraction" and library_name == "sentence_transformers": | |
| if library_name == "sentence_transformers": |
be equivalent ?
| # Step 3. Apply quantization | ||
| with TemporaryDirectory() as tmpdir: | ||
| # Save quantized model to a temporary directory to avoid conflicts when reading and writing from the same directory | ||
| model._apply_quantization( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not give the quantization_config when calling from_pretrained above instead ?
| # TODO: Remove GPT-OSS workaround when possible | ||
| quantization_config = None if ov_config is None else ov_config.quantization_config | ||
| is_generic_quantization = quantization_config and not isinstance(quantization_config, _GPTOSSQuantizationConfig) | ||
| if is_generic_quantization: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the difference is that quantization will be applied after loading the model with an OVModel class is this correct ? Can you add a comment to specify it ? Current workflow is a bit confusing to me still, especially for case where we use OVModelForXxx to export and quantize a model, does it mean that we will create a new instance of OVModelForXxx in main_export?
| main_export( |
| @staticmethod | ||
| def _apply_quantization( | ||
| model: "OVBaseModel", | ||
| def _preprocess_quantization_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we should replace with more explicit name like _set_quantization_config_processor (or do you think that other steps/modifications not related to this will be added in the future?)
What does this PR do?
Changes
This PR refactors the way how quantization is applied during model export via CLI.
optimum/commands/export/openvino.pytooptimum/exporters/openvino/__main__.py. This serves both practical (finaltaskvalue is available only at__main__.py) and style (ticket 176928) reasons. When quantization config is explicitly provided (i.e.--weight-formator--quant-modeoptions) the export flow now looks like this:OVModelFor*classmodel._apply_quantization()-- the entry point introduced in [OpenVINO] Refactor from_pretrained quantization #1520outputdirectory.OVBaseModel.OVBaseModel._prepare_quantization_config()is decoupled into_resolve_default_quantization_config()-- matches default int4 config based on model id_quantization_config_from_dict()-- called inside_apply_quantization()_preprocess_quantization_config()-- applies model-specific updates to the config, such as setting tokenizer/processor.Reason for changes
Since such matching is needed to be done through export via both CLI and API, for both data-free and data-aware quantization and for every low-precision data type, some unification was needed. The problem with the current CLI approach is that data-free and data-aware quantization paths are different: the data-free one is applied at
__main__.py::main_export()while data-aware quantization is applied directly atopenvino.py::run()throughfrom_pretrainedcall. Because of this, it is complicated to match default ignored scope for data-free case.For an example of how default ignored scope matching will be implemented based on changes in this PR please take a look here.
main_export()call similar to optimum-onnx to allow programmatic usage by third-party components like Olive (ticket 176928). With the changes in this PR, a significant amount of logic is moved insidemain_export, only quantization config creation part is left to be moved.Potential drawbacks
stabilityai/stable-diffusion-3.5-largeexported with--weight-format int8below.Single-model pipelines are not affected since there is no "accumulation" effect. Among multi-model pipelines, VLMs are not affected because they are quantized the "new" way already at this moment.
In the future I believe it should be possible to improve this by introducing immediate serialization logic to
OVQuantizer.quantize()whensave_directoryis provided. This will require however an API for mapping submodel names to their paths similar toOVBaseModel.ov_models.original_model_size + quantized_model_size, while before for data-free weight-only quantization it wasoriginal_model_size + largest_quantized_submodel_size.The immediate serialization logic suggested above will help with this too, but again it requires some additional work. In general, I don't see this as a major blocker considering the fact that both drawbacks already hold for data-aware quantization as of this moment.
Update
I was actually able to implement the aforementioned immediate serialization logic and it demonstrates the same memory profile as "Before" case on the figure above. We can proceed with implementation of this approach after this PR is merged. Diff: nikita-savelyevv@9c260c9.
Related tickets
175336, 176390, 176928
Before submitting