[OpenVINO] Refactor CLI quantization #1525

nikita-savelyevv · 2025-11-19T12:43:29Z

What does this PR do?

Changes

This PR refactors the way how quantization is applied during model export via CLI.

Quantization application logic is moved from optimum/commands/export/openvino.py to optimum/exporters/openvino/__main__.py. This serves both practical (final task value is available only at __main__.py) and style (ticket 176928) reasons. When quantization config is explicitly provided (i.e. --weight-format or --quant-mode options) the export flow now looks like this:
1. Original floating point model is exported to a temporary directory. This is to avoid confusion in the case when quantization fails and an intermediate floating point model ends up at the target location.
2. A model is loaded from the temp. dir. using OVModelFor* class
3. Quantization is run using model._apply_quantization() -- the entry point introduced in [OpenVINO] Refactor from_pretrained quantization #1520
4. Quantized model is saved to another temporary directory. This is to avoid loading and saving to the same location.
5. The loaded model object is unloaded and quantized model's files are moved to the floating point model's temporary directory.
6. Final model's files are moved to the target output directory.
There are also additional changes to OVBaseModel. OVBaseModel._prepare_quantization_config() is decoupled into
1. _resolve_default_quantization_config() -- matches default int4 config based on model id
2. _quantization_config_from_dict() -- called inside _apply_quantization()
3. _preprocess_quantization_config() -- applies model-specific updates to the config, such as setting tokenizer/processor.

Reason for changes

This logic is required for implementing default ignored scope (ticket 175336) and default dynamic quantization group size (ticket 176390) matching based on model id.
Since such matching is needed to be done through export via both CLI and API, for both data-free and data-aware quantization and for every low-precision data type, some unification was needed. The problem with the current CLI approach is that data-free and data-aware quantization paths are different: the data-free one is applied at __main__.py::main_export() while data-aware quantization is applied directly at openvino.py::run() through from_pretrained call. Because of this, it is complicated to match default ignored scope for data-free case.
For an example of how default ignored scope matching will be implemented based on changes in this PR please take a look here.
There is an ask for encapsulating OpenVINO export path through a single main_export() call similar to optimum-onnx to allow programmatic usage by third-party components like Olive (ticket 176928). With the changes in this PR, a significant amount of logic is moved inside main_export, only quantization config creation part is left to be moved.

Potential drawbacks

It is possible, that peak RAM will increase during model export with explicit data-free weight-only quantization of multimodal pipelines (except VLMs). This is because, before, most of these pipelines were quantized per-model here, i.e. each quantized model was written to the disk right away. Now in this case all quantized models will need to be gathered in memory in order to be written to the disk. That said, most commonly peak RAM is reached during model export, not quantization. See for example a figure for stabilityai/stable-diffusion-3.5-large exported with --weight-format int8 below.

Before	After

Single-model pipelines are not affected since there is no "accumulation" effect. Among multi-model pipelines, VLMs are not affected because they are quantized the "new" way already at this moment.

In the future I believe it should be possible to improve this by introducing immediate serialization logic to OVQuantizer.quantize() when save_directory is provided. This will require however an API for mapping submodel names to their paths similar to OVBaseModel.ov_models.

The same behavior affects the peak free disk size needed during export. Because of the fact that we save to a temporary directory first the max disk size footprint is original_model_size + quantized_model_size, while before for data-free weight-only quantization it was original_model_size + largest_quantized_submodel_size.

The immediate serialization logic suggested above will help with this too, but again it requires some additional work. In general, I don't see this as a major blocker considering the fact that both drawbacks already hold for data-aware quantization as of this moment.

Update

I was actually able to implement the aforementioned immediate serialization logic and it demonstrates the same memory profile as "Before" case on the figure above. We can proceed with implementation of this approach after this PR is merged. Diff: nikita-savelyevv@9c260c9.

Related tickets

175336, 176390, 176928

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2025-11-19T12:45:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Tiny fix

nikita-savelyevv · 2025-11-26T16:30:24Z

Hi @echarlaix @helena-intel @IlyasMoutawwakil @rkazants . Please take a look at this PR. The diff is quite big, but I've prepared an extensive PR description to hopefully ease the review process. Thanks!

Copilot

Pull request overview

This PR refactors the CLI quantization flow for OpenVINO model exports by moving quantization logic from the command layer to the export layer. The changes enable better default quantization config matching based on model IDs and simplify future enhancements.

Key Changes:

Quantization application moved from optimum/commands/export/openvino.py to optimum/exporters/openvino/__main__.py
OVBaseModel._prepare_quantization_config() decoupled into three methods: _resolve_default_quantization_config(), _quantization_config_from_dict(), and _preprocess_quantization_config()
Quantization now uses temporary directories to avoid partial output on failure

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
optimum/intel/openvino/modeling_base.py	Refactored quantization config preparation and application methods
optimum/exporters/openvino/main.py	Added generic quantization application logic and helper functions
optimum/commands/export/openvino.py	Removed data-aware quantization handling logic
optimum/intel/openvino/quantization.py	Updated to save OV config via model object
tests/openvino/test_exporters_cli.py	Updated task names and added test skip logic
tests/openvino/test_quantization.py	Updated stack frame depth for assertion
optimum/intel/openvino/modeling_*.py	Added `_preprocess_quantization_config()` implementations for various model types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.