Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OV] Introduce --quant-mode cli argument enabling full quantization via optimum-cli #1061

Merged

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Dec 10, 2024

What does this PR do?

  • Introduce --quant-mode export parameter. Currently the only supported value is int8 for int8 weights and activations. In the future it can be extended to other combinations, for example int8/fp8.
  • Introduce weight_format and activation_format fields to OVQuantizationConfig. Currently they are not used as int8/int8 is the only supported option.
  • Enable quantization of speech-to-text pipelines via optimum-cli. Example command is below.
optimum-cli export openvino -m openai/whisper-tiny.en --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-tiny-en

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review December 10, 2024 20:11
@@ -155,6 +163,12 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with
</Tip>


Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to `int8/int8`. This will quantize both weights and activations of Linear, Convolutional and some other layers to int8. Currently this is only supported for speech-to-text models. Please see example below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echarlaix, @IlyasMoutawwakil, the idea is to expose the capabilities of OVQuantizer in the optimum-cli.

@AlexKoff88
Copy link
Collaborator

@echarlaix, @IlyasMoutawwakil, the PR is ready for your review

@nikita-savelyevv
Copy link
Collaborator Author

@echarlaix @IlyasMoutawwakil could you please review this PR some time this week? I'm on a vacation starting from the next week. Thanks!

@AlexKoff88
Copy link
Collaborator

@eaidova, can you please take a look as well.

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks @nikita-savelyevv, I added a minor comment about an argument name (no worries for us to take care of this part in case you're away at this time, won't block merging)

@@ -66,6 +67,10 @@ Optional arguments:
on your local machine arbitrary code present in the model repository.
--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}
The weight format of the exported model.
--quant-mode {int8}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentionned this morning by @IlyasMoutawwakil, I'm also finding quant_method not super clear, any possibility to rename it to something like activation-format or something even more adapted ? @AlexKoff88 let me know if you think about something that could work (+ apologies as it'll likely make you reiterate what you said this morning)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, @echarlaix. The issue with activation-format is that we will let user so much freedom to mix different types of weights and activations data types that we don't support, e.g. W-FP8/A-INT8, W-INT4/A-FP8, etc. We target --weight-format for weight-only quantization and it corresponds nncf.compress_weights() API while --quant-mode controls the type of static full model quantization and corresponds nncf.quantize(..., mode=) parameter, e.g. INT8, FP8, etc.

Copy link
Collaborator

@AlexKoff88 AlexKoff88 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we don't assume that --weight-format and --quant-mode are used together, only separately.

@AlexKoff88 AlexKoff88 merged commit ea6fa42 into huggingface:main Dec 20, 2024
22 checks passed
@AlexKoff88
Copy link
Collaborator

Merging as we have few approvals and no strong objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants