-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV] Introduce --quant-mode
cli argument enabling full quantization via optimum-cli
#1061
[OV] Introduce --quant-mode
cli argument enabling full quantization via optimum-cli
#1061
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
docs/source/openvino/export.mdx
Outdated
@@ -155,6 +163,12 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with | |||
</Tip> | |||
|
|||
|
|||
Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to `int8/int8`. This will quantize both weights and activations of Linear, Convolutional and some other layers to int8. Currently this is only supported for speech-to-text models. Please see example below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echarlaix, @IlyasMoutawwakil, the idea is to expose the capabilities of OVQuantizer in the optimum-cli.
@echarlaix, @IlyasMoutawwakil, the PR is ready for your review |
@echarlaix @IlyasMoutawwakil could you please review this PR some time this week? I'm on a vacation starting from the next week. Thanks! |
@eaidova, can you please take a look as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks @nikita-savelyevv, I added a minor comment about an argument name (no worries for us to take care of this part in case you're away at this time, won't block merging)
@@ -66,6 +67,10 @@ Optional arguments: | |||
on your local machine arbitrary code present in the model repository. | |||
--weight-format {fp32,fp16,int8,int4,mxfp4,nf4} | |||
The weight format of the exported model. | |||
--quant-mode {int8} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentionned this morning by @IlyasMoutawwakil, I'm also finding quant_method
not super clear, any possibility to rename it to something like activation-format
or something even more adapted ? @AlexKoff88 let me know if you think about something that could work (+ apologies as it'll likely make you reiterate what you said this morning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, @echarlaix. The issue with activation-format is that we will let user so much freedom to mix different types of weights and activations data types that we don't support, e.g. W-FP8/A-INT8, W-INT4/A-FP8, etc. We target --weight-format
for weight-only quantization and it corresponds nncf.compress_weights()
API while --quant-mode
controls the type of static full model quantization and corresponds nncf.quantize(..., mode=)
parameter, e.g. INT8
, FP8
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we don't assume that --weight-format
and --quant-mode
are used together, only separately.
Merging as we have few approvals and no strong objections. |
What does this PR do?
--quant-mode
export parameter. Currently the only supported value isint8
for int8 weights and activations. In the future it can be extended to other combinations, for exampleint8/fp8
.weight_format
andactivation_format
fields toOVQuantizationConfig
. Currently they are not used asint8/int8
is the only supported option.Before submitting