[OV] Introduce `--quant-mode` cli argument enabling full quantization via optimum-cli #1061

nikita-savelyevv · 2024-12-10T16:10:04Z

What does this PR do?

Introduce --quant-mode export parameter. Currently the only supported value is int8 for int8 weights and activations. In the future it can be extended to other combinations, for example int8/fp8.
Introduce weight_format and activation_format fields to OVQuantizationConfig. Currently they are not used as int8/int8 is the only supported option.
Enable quantization of speech-to-text pipelines via optimum-cli. Example command is below.

optimum-cli export openvino -m openai/whisper-tiny.en --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-tiny-en

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-12-10T16:15:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/commands/export/openvino.py

AlexKoff88 · 2024-12-11T06:03:49Z

docs/source/openvino/export.mdx

@@ -155,6 +163,12 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with
 </Tip>


+Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to `int8/int8`. This will quantize both weights and activations of Linear, Convolutional and some other layers to int8. Currently this is only supported for speech-to-text models. Please see example below.


@echarlaix, @IlyasMoutawwakil, the idea is to expose the capabilities of OVQuantizer in the optimum-cli.

AlexKoff88 · 2024-12-11T14:56:37Z

@echarlaix, @IlyasMoutawwakil, the PR is ready for your review

nikita-savelyevv · 2024-12-16T10:13:45Z

@echarlaix @IlyasMoutawwakil could you please review this PR some time this week? I'm on a vacation starting from the next week. Thanks!

AlexKoff88 · 2024-12-17T15:12:02Z

@eaidova, can you please take a look as well.

echarlaix

Looks good thanks @nikita-savelyevv, I added a minor comment about an argument name (no worries for us to take care of this part in case you're away at this time, won't block merging)

echarlaix · 2024-12-19T16:58:35Z

docs/source/openvino/export.mdx

@@ -66,6 +67,10 @@ Optional arguments:
                        on your local machine arbitrary code present in the model repository.
  --weight-format {fp32,fp16,int8,int4,mxfp4,nf4}
                        The weight format of the exported model.
+  --quant-mode {int8}


As mentionned this morning by @IlyasMoutawwakil, I'm also finding quant_method not super clear, any possibility to rename it to something like activation-format or something even more adapted ? @AlexKoff88 let me know if you think about something that could work (+ apologies as it'll likely make you reiterate what you said this morning)

No problem, @echarlaix. The issue with activation-format is that we will let user so much freedom to mix different types of weights and activations data types that we don't support, e.g. W-FP8/A-INT8, W-INT4/A-FP8, etc. We target --weight-format for weight-only quantization and it corresponds nncf.compress_weights() API while --quant-mode controls the type of static full model quantization and corresponds nncf.quantize(..., mode=) parameter, e.g. INT8, FP8, etc.

BTW, we don't assume that --weight-format and --quant-mode are used together, only separately.

AlexKoff88 · 2024-12-20T13:52:31Z

Merging as we have few approvals and no strong objections.

Introduce --quant-mode cli argument

08465cf

nikita-savelyevv added 4 commits December 10, 2024 17:17

Make int8 by default

ece3e05

Add a test

d409d2c

Add documentation

2d93e67

Fix command

21164b5

nikita-savelyevv marked this pull request as ready for review December 10, 2024 20:11

nikita-savelyevv requested a review from AlexKoff88 December 10, 2024 20:11

AlexKoff88 reviewed Dec 11, 2024

View reviewed changes

optimum/commands/export/openvino.py Outdated Show resolved Hide resolved

AlexKoff88 reviewed Dec 11, 2024

View reviewed changes

nikita-savelyevv added 3 commits December 11, 2024 10:29

Replace 'int8/int8' by 'int8'

3284c78

Add missing docstring

ee6a8f8

Add trust_remote_code

54f443d

AlexKoff88 requested review from echarlaix and IlyasMoutawwakil December 11, 2024 14:56

Fix condition

6aefc75

AlexKoff88 approved these changes Dec 13, 2024

View reviewed changes

AlexKoff88 requested a review from eaidova December 17, 2024 15:11

AlexKoff88 assigned IlyasMoutawwakil Dec 17, 2024

eaidova approved these changes Dec 17, 2024

View reviewed changes

nikita-savelyevv added 3 commits December 19, 2024 12:59

Merge branch 'main' into ns/whisper-cli-quantization

2ab28f0

Trigger Tests

4178153

Trigger Tests

560e877

echarlaix approved these changes Dec 19, 2024

View reviewed changes

AlexKoff88 merged commit ea6fa42 into huggingface:main Dec 20, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OV] Introduce `--quant-mode` cli argument enabling full quantization via optimum-cli #1061

[OV] Introduce `--quant-mode` cli argument enabling full quantization via optimum-cli #1061

nikita-savelyevv commented Dec 10, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 10, 2024

AlexKoff88 Dec 11, 2024

AlexKoff88 commented Dec 11, 2024

nikita-savelyevv commented Dec 16, 2024

AlexKoff88 commented Dec 17, 2024

echarlaix left a comment

echarlaix Dec 19, 2024

AlexKoff88 Dec 20, 2024

AlexKoff88 Dec 20, 2024 •

edited

Loading

AlexKoff88 commented Dec 20, 2024

		@@ -155,6 +163,12 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with
		</Tip>


		Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to `int8/int8`. This will quantize both weights and activations of Linear, Convolutional and some other layers to int8. Currently this is only supported for speech-to-text models. Please see example below.

[OV] Introduce --quant-mode cli argument enabling full quantization via optimum-cli #1061

[OV] Introduce --quant-mode cli argument enabling full quantization via optimum-cli #1061

Conversation

nikita-savelyevv commented Dec 10, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Dec 10, 2024

AlexKoff88 Dec 11, 2024

Choose a reason for hiding this comment

AlexKoff88 commented Dec 11, 2024

nikita-savelyevv commented Dec 16, 2024

AlexKoff88 commented Dec 17, 2024

echarlaix left a comment

Choose a reason for hiding this comment

echarlaix Dec 19, 2024

Choose a reason for hiding this comment

AlexKoff88 Dec 20, 2024

Choose a reason for hiding this comment

AlexKoff88 Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

AlexKoff88 commented Dec 20, 2024

[OV] Introduce `--quant-mode` cli argument enabling full quantization via optimum-cli #1061

[OV] Introduce `--quant-mode` cli argument enabling full quantization via optimum-cli #1061

nikita-savelyevv commented Dec 10, 2024 •

edited

Loading

AlexKoff88 Dec 20, 2024 •

edited

Loading