Skip to content

Conversation

@nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Nov 19, 2025

What does this PR do?

Changes

This PR refactors the way how quantization is applied during model export via CLI.

  1. Quantization application logic is moved from optimum/commands/export/openvino.py to optimum/exporters/openvino/__main__.py. This serves both practical (final task value is available only at __main__.py) and style (ticket 176928) reasons. When quantization config is explicitly provided (i.e. --weight-format or --quant-mode options) the export flow now looks like this:
    1. Original floating point model is exported to a temporary directory. This is to avoid confusion in the case when quantization fails and an intermediate floating point model ends up at the target location.
    2. A model is loaded from the temp. dir. using OVModelFor* class
    3. Quantization is run using model._apply_quantization() -- the entry point introduced in [OpenVINO] Refactor from_pretrained quantization #1520
    4. Quantized model is saved to another temporary directory. This is to avoid loading and saving to the same location.
    5. The loaded model object is unloaded and quantized model's files are moved to the floating point model's temporary directory.
    6. Final model's files are moved to the target output directory.
  2. There are also additional changes to OVBaseModel. OVBaseModel._prepare_quantization_config() is decoupled into
    1. _resolve_default_quantization_config() -- matches default int4 config based on model id
    2. _quantization_config_from_dict() -- called inside _apply_quantization()
    3. _preprocess_quantization_config() -- applies model-specific updates to the config, such as setting tokenizer/processor.

Reason for changes

  1. This logic is required for implementing default ignored scope (ticket 175336) and default dynamic quantization group size (ticket 176390) matching based on model id.
    Since such matching is needed to be done through export via both CLI and API, for both data-free and data-aware quantization and for every low-precision data type, some unification was needed. The problem with the current CLI approach is that data-free and data-aware quantization paths are different: the data-free one is applied at __main__.py::main_export() while data-aware quantization is applied directly at openvino.py::run() through from_pretrained call. Because of this, it is complicated to match default ignored scope for data-free case.
    For an example of how default ignored scope matching will be implemented based on changes in this PR please take a look here.
  2. There is an ask for encapsulating OpenVINO export path through a single main_export() call similar to optimum-onnx to allow programmatic usage by third-party components like Olive (ticket 176928). With the changes in this PR, a significant amount of logic is moved inside main_export, only quantization config creation part is left to be moved.

Potential drawbacks

  1. It is possible, that peak RAM will increase during model export with explicit data-free weight-only quantization of multimodal pipelines (except VLMs). This is because, before, most of these pipelines were quantized per-model here, i.e. each quantized model was written to the disk right away. Now in this case all quantized models will need to be gathered in memory in order to be written to the disk. That said, most commonly peak RAM is reached during model export, not quantization. See for example a figure for stabilityai/stable-diffusion-3.5-large exported with --weight-format int8 below.
Before After
sd_3 5_large_main sd_3 5_large_branch

Single-model pipelines are not affected since there is no "accumulation" effect. Among multi-model pipelines, VLMs are not affected because they are quantized the "new" way already at this moment.

In the future I believe it should be possible to improve this by introducing immediate serialization logic to OVQuantizer.quantize() when save_directory is provided. This will require however an API for mapping submodel names to their paths similar to OVBaseModel.ov_models.

  1. The same behavior affects the peak free disk size needed during export. Because of the fact that we save to a temporary directory first the max disk size footprint is original_model_size + quantized_model_size, while before for data-free weight-only quantization it was original_model_size + largest_quantized_submodel_size.

The immediate serialization logic suggested above will help with this too, but again it requires some additional work. In general, I don't see this as a major blocker considering the fact that both drawbacks already hold for data-aware quantization as of this moment.

Update

I was actually able to implement the aforementioned immediate serialization logic and it demonstrates the same memory profile as "Before" case on the figure above. We can proceed with implementation of this approach after this PR is merged. Diff: nikita-savelyevv@9c260c9.

Related tickets

175336, 176390, 176928

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv force-pushed the ns/refactor-cli-quantization branch from aa0e0c2 to 3639a3f Compare November 25, 2025 09:05
@nikita-savelyevv nikita-savelyevv force-pushed the ns/refactor-cli-quantization branch from d45807f to 6c0660a Compare November 25, 2025 21:06
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review November 26, 2025 16:28
@nikita-savelyevv
Copy link
Collaborator Author

nikita-savelyevv commented Nov 26, 2025

Hi @echarlaix @helena-intel @IlyasMoutawwakil @rkazants . Please take a look at this PR. The diff is quite big, but I've prepared an extensive PR description to hopefully ease the review process. Thanks!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CLI quantization flow for OpenVINO model exports by moving quantization logic from the command layer to the export layer. The changes enable better default quantization config matching based on model IDs and simplify future enhancements.

Key Changes:

  • Quantization application moved from optimum/commands/export/openvino.py to optimum/exporters/openvino/__main__.py
  • OVBaseModel._prepare_quantization_config() decoupled into three methods: _resolve_default_quantization_config(), _quantization_config_from_dict(), and _preprocess_quantization_config()
  • Quantization now uses temporary directories to avoid partial output on failure

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
optimum/intel/openvino/modeling_base.py Refactored quantization config preparation and application methods
optimum/exporters/openvino/main.py Added generic quantization application logic and helper functions
optimum/commands/export/openvino.py Removed data-aware quantization handling logic
optimum/intel/openvino/quantization.py Updated to save OV config via model object
tests/openvino/test_exporters_cli.py Updated task names and added test skip logic
tests/openvino/test_quantization.py Updated stack frame depth for assertion
optimum/intel/openvino/modeling_*.py Added _preprocess_quantization_config() implementations for various model types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @nikita-savelyevv!!

Comment on lines +204 to +205
from ...intel.openvino.configuration import _GPTOSSQuantizationConfig
from ...intel.openvino.utils import TemporaryDirectory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible would replace with absolute imports

Suggested change
from ...intel.openvino.configuration import _GPTOSSQuantizationConfig
from ...intel.openvino.utils import TemporaryDirectory
from optimum.intel.openvino.configuration import _GPTOSSQuantizationConfig
from optimum.intel.openvino.utils import TemporaryDirectory


from optimum.intel.openvino.quantization import _weight_only_quantization

from ...intel.openvino.configuration import _GPTOSSQuantizationConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ?

Suggested change
from ...intel.openvino.configuration import _GPTOSSQuantizationConfig
from optimum.intel.openvino.configuration import _GPTOSSQuantizationConfig

else:
try:
model_cls_name = _HEAD_TO_AUTOMODELS[task.replace("-with-past", "")]
if model_cls_name == "OVModelForFeatureExtraction" and library_name == "sentence_transformers":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't

Suggested change
if model_cls_name == "OVModelForFeatureExtraction" and library_name == "sentence_transformers":
if library_name == "sentence_transformers":

be equivalent ?

# Step 3. Apply quantization
with TemporaryDirectory() as tmpdir:
# Save quantized model to a temporary directory to avoid conflicts when reading and writing from the same directory
model._apply_quantization(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not give the quantization_config when calling from_pretrained above instead ?

# TODO: Remove GPT-OSS workaround when possible
quantization_config = None if ov_config is None else ov_config.quantization_config
is_generic_quantization = quantization_config and not isinstance(quantization_config, _GPTOSSQuantizationConfig)
if is_generic_quantization:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the difference is that quantization will be applied after loading the model with an OVModel class is this correct ? Can you add a comment to specify it ? Current workflow is a bit confusing to me still, especially for case where we use OVModelForXxx to export and quantize a model, does it mean that we will create a new instance of OVModelForXxx in main_export?

@staticmethod
def _apply_quantization(
model: "OVBaseModel",
def _preprocess_quantization_config(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should replace with more explicit name like _set_quantization_config_processor (or do you think that other steps/modifications not related to this will be added in the future?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants