Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

PenghuiCheng · 2023-10-16T16:19:18Z

What does this PR do?

Intel-extension-for-transformers package implements weight-only quantization operators with jblas kernel. So we integrated weight-only quantization with intel-extension-for-transformers in this PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2023-10-17T05:17:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

examples/neural_compressor/language-modeling/run_clm.py

optimum/intel/neural_compressor/__init__.py

examples/neural_compressor/language-modeling/run_clm.py

optimum/intel/neural_compressor/quantization.py

tests/neural_compressor/test_optimization.py

setup.py

…tension-for-transformers

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

…tension-for-transformers

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng · 2024-01-16T12:29:27Z

Hi, @echarlaix , I rebased code with main branch. Please review it, thanks very much!

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng · 2024-01-18T08:59:58Z

Hi, @echarlaix , it seems to build intel-extension-for-transformers failed in the pre-CI test. Before executing the "python setup.py install" command, we should pip install dependency packages with "pip install -r requirements.txt" in intel-extension-for-transformers. Could you review it? thanks!

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

echarlaix

Thanks a lot for your work @PenghuiCheng

examples/neural_compressor/language-modeling/run_clm.py

examples/neural_compressor/language-modeling/requirements.txt

optimum/intel/neural_compressor/quantization.py

optimum/intel/utils/import_utils.py

optimum/intel/neural_compressor/quantization.py

tests/neural_compressor/test_optimization.py

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

* [OV]: Fixed inferece after 4 bit weight compression * Fixed issue * Update optimum/intel/openvino/modeling_decoder.py Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> * Applied comments * Fixed issue when request is None --------- Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

* Updated docs with load_in_4bit * Update documentation * Update documentation * typo --------- Co-authored-by: Ella Charlaix <ella@huggingface.co>

* fix compatibility for latest transformers release * update setup * update setup * fix test input size * fix prepare generation for llama models

* deprecate compression options * style * fix configuration * Update CLI argument * update documentation * deprecate torch nn modules for ov quantizer * fix ov config for fp32 models * fix format * update documentation * Add check for configuration * fix ratio default value for SD models * add quantization_config argument for OVModel * remove commented line * Update docs/source/inference.mdx Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com> * add default config for causal LM * fix warning message --------- Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

…y_with_itrex

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

tests/openvino/test_modeling_basic.py

examples/neural_compressor/language-modeling/README.md

examples/neural_compressor/language-modeling/run_clm.py

optimum/intel/neural_compressor/quantization.py

optimum/intel/neural_compressor/modeling_base.py

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

.github/workflows/test_inc.yml

examples/neural_compressor/text-generation/run_generation.py

optimum/intel/neural_compressor/modeling_base.py

optimum/intel/neural_compressor/quantization.py

echarlaix

Looks great thanks @PenghuiCheng

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

.github/workflows/test_inc.yml

optimum/intel/neural_compressor/modeling_base.py

examples/neural_compressor/language-modeling/run_clm.py

optimum/intel/neural_compressor/modeling_base.py

…y_with_itrex

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng requested review from hshen14 and changwangss October 16, 2023 16:22

echarlaix reviewed Oct 17, 2023

View reviewed changes

setup.py Outdated Show resolved Hide resolved

PenghuiCheng force-pushed the penghuic/weight_only_with_itrex branch 3 times, most recently from fbf9ddf to 9d03415 Compare October 23, 2023 08:22

hshen14 approved these changes Nov 2, 2023

View reviewed changes

PenghuiCheng added 8 commits January 16, 2024 20:16

Support weight-only quantization with quantized operators in intel-ex…

86d378f

…tension-for-transformers

Update code style

ca58fa5

Update readme for weight-only quantization example

4837b2f

Update code

25b2664

Adapt intel-extension-for-transformers 1.3 API change

a36584b

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Support weight-only quantization with quantized operators in intel-ex…

9ebc5a9

…tension-for-transformers

Update code

d0f1c71

rebase code on main branch

ed873c9

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng force-pushed the penghuic/weight_only_with_itrex branch from ecaac6e to ed873c9 Compare January 16, 2024 12:27

Update example

de190fd

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng force-pushed the penghuic/weight_only_with_itrex branch from bee33e1 to de190fd Compare January 17, 2024 00:47

merge from main branch

59a1f81

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

echarlaix reviewed Feb 29, 2024

View reviewed changes

PenghuiCheng and others added 6 commits March 13, 2024 14:56

Update optimum/intel/neural_compressor/quantization.py

416b528

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Updated docs with load_in_4bit (huggingface#558)

65d5a97

* Updated docs with load_in_4bit * Update documentation * Update documentation * typo --------- Co-authored-by: Ella Charlaix <ella@huggingface.co>

Update Transformers dependency requirements (huggingface#571)

bddd203

Fix compatibility for latest transformers release (huggingface#570)

70a6373

* fix compatibility for latest transformers release * update setup * update setup * fix test input size * fix prepare generation for llama models

PenghuiCheng added 4 commits March 17, 2024 21:47

Fixed UT and examples error

0cc7c00

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…

3d28d4a

…y_with_itrex

Fixed pre-CI error

9ec53ce

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Fixed UT error

66d45c2

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

echarlaix reviewed Mar 18, 2024

View reviewed changes

PenghuiCheng and others added 10 commits March 23, 2024 21:12

Update tests/openvino/test_modeling_basic.py

4347cee

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/README.md

68d6e90

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/run_clm.py

032b0ef

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/run_clm.py

6a6a97c

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/run_clm.py

8e90ac8

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/run_clm.py

88760bc

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update examples/neural_compressor/language-modeling/run_clm.py

f51266a

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Load weight-only quantized model with INCModelForCausalLM

f970272

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Merge from main branch and update code style

e5558b0

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Changed parameters name for GPTQ in example

5ddd360

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

PenghuiCheng force-pushed the penghuic/weight_only_with_itrex branch from 2a905cc to 5ddd360 Compare March 25, 2024 02:37

PenghuiCheng added 2 commits March 25, 2024 10:53

Changed parameters order in INCQuantizer.quantize

721dd3b

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Fixed UT error

ac9aee8

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

echarlaix reviewed Mar 25, 2024

View reviewed changes

PenghuiCheng and others added 3 commits March 26, 2024 09:34

Update examples/neural_compressor/text-generation/run_generation.py

d7bd27e

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update optimum/intel/neural_compressor/quantization.py

19bdf0f

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

Update optimum/intel/neural_compressor/quantization.py

dd981df

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

echarlaix reviewed Mar 26, 2024

View reviewed changes

PenghuiCheng added 5 commits March 27, 2024 10:02

Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…

94f1ac5

…y_with_itrex

Update import message

af07192

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Limit intel-extension-for-transformers version

9c24871

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Limit torch version for weight-only quantization

1331cdc

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

Fixed doc building error

638f516

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>

echarlaix merged commit 08fc8ed into huggingface:main Mar 27, 2024
13 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

PenghuiCheng commented Oct 16, 2023

HuggingFaceDocBuilderDev commented Oct 17, 2023

PenghuiCheng commented Jan 16, 2024

PenghuiCheng commented Jan 18, 2024

echarlaix left a comment

echarlaix left a comment

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

Conversation

PenghuiCheng commented Oct 16, 2023

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Oct 17, 2023

PenghuiCheng commented Jan 16, 2024

PenghuiCheng commented Jan 18, 2024

echarlaix left a comment

Choose a reason for hiding this comment

echarlaix left a comment

Choose a reason for hiding this comment