Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
86d378f
Support weight-only quantization with quantized operators in intel-ex…
PenghuiCheng Oct 16, 2023
ca58fa5
Update code style
PenghuiCheng Oct 16, 2023
4837b2f
Update readme for weight-only quantization example
PenghuiCheng Oct 16, 2023
25b2664
Update code
PenghuiCheng Oct 19, 2023
a36584b
Adapt intel-extension-for-transformers 1.3 API change
PenghuiCheng Dec 19, 2023
9ebc5a9
Support weight-only quantization with quantized operators in intel-ex…
PenghuiCheng Oct 16, 2023
d0f1c71
Update code
PenghuiCheng Oct 19, 2023
ed873c9
rebase code on main branch
PenghuiCheng Jan 16, 2024
de190fd
Update example
PenghuiCheng Jan 17, 2024
59a1f81
merge from main branch
PenghuiCheng Feb 21, 2024
416b528
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 13, 2024
d62964a
[OV]: Fixed inference after 4 bit weight compression (#569)
AlexKoff88 Feb 21, 2024
65d5a97
Updated docs with load_in_4bit (#558)
AlexKoff88 Feb 21, 2024
bddd203
Update Transformers dependency requirements (#571)
echarlaix Feb 21, 2024
70a6373
Fix compatibility for latest transformers release (#570)
echarlaix Feb 22, 2024
1437b1b
Deprecate compression options (#565)
echarlaix Feb 27, 2024
32e8fa2
Add default quantization int4 config for Mixtral-8x7B (#576)
ljaljushkin Feb 28, 2024
e3f009b
Update stable diffusion example requirements (#579)
helena-intel Feb 29, 2024
6621611
Fix collecting duplicate tensors in quantization calibration dataset …
nikita-savelyevv Mar 1, 2024
ca33bed
Save an openvino config summarizing all information related to quanti…
echarlaix Mar 1, 2024
22bc3d0
Fix warning (#582)
echarlaix Mar 4, 2024
b56021e
Add reference to the temporary directory for windows fix (#581)
echarlaix Mar 4, 2024
8e68d38
Fix documentation (#583)
echarlaix Mar 4, 2024
26977f8
Add llama test model to cover MQA (#585)
jiqing-feng Mar 6, 2024
7516637
Include nncf in openvino extra (#586)
eaidova Mar 6, 2024
246c829
Fix title documentation (#588)
echarlaix Mar 6, 2024
4c481e6
Update OpenVINO documentation links in README.md (#587)
kblaszczak-intel Mar 7, 2024
126a581
Fix default int8 quantization for CLI (#592)
echarlaix Mar 7, 2024
67fad65
Change model output parameter to last_hidden_states for IPEXModel (#589)
jiqing-feng Mar 8, 2024
0bcffeb
Add IPEX model patcher (#567)
jiqing-feng Mar 8, 2024
52ae0a3
Updates weight quantization section in the docs (#593)
AlexKoff88 Mar 8, 2024
b751766
Remove accelerate and onnxruntime from required dependencies (#590)
echarlaix Mar 8, 2024
7674e33
Fix OpenVINO image classification examples (#598)
echarlaix Mar 11, 2024
1e73450
Fix weights compression for OPenVINO models (#596)
eaidova Mar 11, 2024
dc14a2b
Fix default ov config (#600)
echarlaix Mar 11, 2024
de243f0
Add warning for transformers>=4.38 and OpenVINO 2024.0 (#599)
helena-intel Mar 12, 2024
345f9e5
Add hybrid quantization for StableDiffusion pipelines (#584)
l-bat Mar 12, 2024
f68486b
Show device name in _print_compiled_model_properties (#541)
helena-intel Mar 12, 2024
00cd903
Update code with comments
PenghuiCheng Mar 13, 2024
6b95933
Fixed pylint error
PenghuiCheng Mar 13, 2024
7bb1827
Merge from main branch
PenghuiCheng Mar 13, 2024
5d90b52
Update optimum/intel/neural_compressor/configuration.py
PenghuiCheng Mar 13, 2024
e804df3
Fixed example and UT for weight-only quantization
PenghuiCheng Mar 13, 2024
82c27dd
Fixed pre-ci test error
PenghuiCheng Mar 13, 2024
3ca3f60
Fixed pre-ci test error
PenghuiCheng Mar 13, 2024
0cc7c00
Fixed UT and examples error
PenghuiCheng Mar 17, 2024
3d28d4a
Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…
PenghuiCheng Mar 17, 2024
9ec53ce
Fixed pre-CI error
PenghuiCheng Mar 17, 2024
66d45c2
Fixed UT error
PenghuiCheng Mar 18, 2024
4347cee
Update tests/openvino/test_modeling_basic.py
PenghuiCheng Mar 23, 2024
68d6e90
Update examples/neural_compressor/language-modeling/README.md
PenghuiCheng Mar 23, 2024
032b0ef
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
6a6a97c
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
8e90ac8
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
88760bc
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
f51266a
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
f970272
Load weight-only quantized model with INCModelForCausalLM
PenghuiCheng Mar 24, 2024
e5558b0
Merge from main branch and update code style
PenghuiCheng Mar 24, 2024
5ddd360
Changed parameters name for GPTQ in example
PenghuiCheng Mar 25, 2024
721dd3b
Changed parameters order in INCQuantizer.quantize
PenghuiCheng Mar 25, 2024
ac9aee8
Fixed UT error
PenghuiCheng Mar 25, 2024
d7bd27e
Update examples/neural_compressor/text-generation/run_generation.py
PenghuiCheng Mar 26, 2024
19bdf0f
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 26, 2024
dd981df
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 26, 2024
94f1ac5
Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…
PenghuiCheng Mar 27, 2024
af07192
Update import message
PenghuiCheng Mar 27, 2024
9c24871
Limit intel-extension-for-transformers version
PenghuiCheng Mar 27, 2024
1331cdc
Limit torch version for weight-only quantization
PenghuiCheng Mar 27, 2024
638f516
Fixed doc building error
PenghuiCheng Mar 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/test_inc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,13 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install cmake
pip install py-cpuinfo
pip install torch==2.1.0 torchaudio==2.1.0 torchvision==0.16 --extra-index-url https://download.pytorch.org/whl/cpu
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
pip install .[neural-compressor,diffusers,tests]
pip install intel-extension-for-pytorch
pip install intel-extension-for-pytorch==2.1.100
pip install intel-extension-for-transformers==1.3.2
pip install peft
- name: Test with Pytest
run: |
pytest tests/neural_compressor/
2 changes: 1 addition & 1 deletion examples/neural_compressor/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,4 @@ respectively `dynamic`, `static`, `weight_only` or `aware_training`.

The flag `--verify_loading` can be passed along to verify that the resulting quantized model can be loaded correctly.

> **_Note:_** `weight_only` quantization_approach requires neural-compressor >= 2.3
> **_Note:_** `weight_only` quantization_approach requires `neural-compressor` >= 2.3 and `intel-extension-for-transformers` >= 1.3.
2 changes: 2 additions & 0 deletions examples/neural_compressor/language-modeling/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ torch >= 1.9
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
intel-extension-for-transformers >= 1.3
peft
95 changes: 66 additions & 29 deletions examples/neural_compressor/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,14 @@
from transformers.utils.versions import require_version

from optimum.intel.neural_compressor import INCModelForCausalLM, INCQuantizer, INCTrainer
from optimum.intel.utils.import_utils import (
INTEL_EXTENSION_FOR_TRANSFORMERS_IMPORT_ERROR,
is_intel_extension_for_transformers_available,
)


if is_intel_extension_for_transformers_available():
from intel_extension_for_transformers.transformers.utils.config import WeightOnlyQuantConfig


os.environ["CUDA_VISIBLE_DEVICES"] = ""
Expand Down Expand Up @@ -143,7 +151,9 @@ class OptimizationArguments:
)
quantization_approach: str = field(
default="dynamic",
metadata={"help": "Quantization approach. Supported approach are static, dynamic and aware_training."},
metadata={
"help": "Quantization approach. Supported approach are static, dynamic aware_training and weight_only."
},
)
smooth_quant: bool = field(
default=False,
Expand Down Expand Up @@ -196,9 +206,13 @@ class OptimizationArguments:
default=False,
metadata={"help": "Whether or not to verify the loading of the quantized model."},
)
bits: int = field(
default=8,
metadata={"help": "Bits for weight only quantization, 1-8 bits."},
bits: str = field(
default="4",
metadata={"help": "Bits number of weight for weight only quantization. 1~8 bits."},
)
weight_dtype: str = field(
default="int4_clip",
metadata={"help": "weight dtype for weight only quantization."},
)
group_size: int = field(
default=-1,
Expand All @@ -214,10 +228,29 @@ class OptimizationArguments:
)
quantization_methodology: str = field(
default="RTN",
metadata={"help": "Quantization methodology for weight only quantization. Choose from 'RTN' and 'GPTQ'."},
)
damp_percent: float = field(
default=0.01,
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
metadata={
"help": "Quantization methodology for weight only quantization. Choose from 'RTN', 'AWQ' and 'GPTQ'."
"help": "Percentage of Hessian's diagonal values average, which will be added to Hessian's diagonal to increase numerical stability, used for GPTQ quantization"
},
)
gptq_block_size: int = field(
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
default=128,
metadata={"help": "Block size. sub weight matrix size to run GPTQ."},
)
num_calibration_samples: int = field(
default=128, metadata={"help": "Number of examples to use for the GPTQ calibration step."}
)
use_max_length: bool = field(
default=False,
metadata={"help": "Set all sequence length to be same length of args.gptq_pad_max_length"},
)
pad_max_length: int = field(
default=2048,
metadata={"help": "Calibration dataset sequence max length, this should align with your model config"},
)


@dataclass
Expand Down Expand Up @@ -625,26 +658,30 @@ def compute_metrics(eval_preds):
else:
recipes = {}
if optim_args.quantization_approach == "weight_only":
op_type_dict = {
".*": {
"weight": {
"bits": optim_args.bits,
"group_size": optim_args.group_size,
"scheme": optim_args.weight_only_scheme,
"algorithm": optim_args.quantization_methodology,
},
},
}
if not is_intel_extension_for_transformers_available():
raise ImportError(INTEL_EXTENSION_FOR_TRANSFORMERS_IMPORT_ERROR.format("WeightOnly quantization"))
if optim_args.apply_pruning or optim_args.apply_distillation:
raise ValueError("Weight only quantization and pruning or distillation cannot be combined.")
if optim_args.quantization_methodology == "GPTQ":
gptq_args = {
"pad_max_length": block_size,
algorithm_args = {
"act_order": False,
"percdamp": optim_args.damp_percent,
"block_size": optim_args.gptq_block_size,
"nsamples": optim_args.num_calibration_samples,
"use_max_length": optim_args.use_max_length,
"pad_max_length": optim_args.pad_max_length,
}
recipes.update({"gptq_args": gptq_args})
quantization_config = WeightOnlyQuantConfig(
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
weight_dtype=optim_args.weight_dtype,
group_size=optim_args.group_size,
scheme=optim_args.weight_only_scheme,
algorithm=optim_args.quantization_methodology,
algorithm_args=algorithm_args if optim_args.quantization_methodology == "GPTQ" else None,
)
else:
op_type_dict = {}
quantization_config = PostTrainingQuantConfig(
approach=optim_args.quantization_approach, op_type_dict=op_type_dict, recipes=recipes
)
quantization_config = PostTrainingQuantConfig(
approach=optim_args.quantization_approach, recipes=recipes
)

if optim_args.apply_pruning:
if optim_args.end_step is None:
Expand Down Expand Up @@ -732,15 +769,15 @@ def compute_metrics(eval_preds):
quantizer.quantize(
quantization_config=quantization_config,
save_directory=training_args.output_dir,
calibration_dataset=train_dataset
if optim_args.quantization_approach in ["static", "weight_only"]
else None,
batch_size=1 # batch_size > 1 for GPTQ is WIP
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
if optim_args.quantization_approach == "weight_only" and optim_args.quantization_methodology == "GPTQ"
else training_args.per_device_train_batch_size,
weight_only=True if optim_args.quantization_approach == "weight_only" else False,
calibration_dataset=(
train_dataset if optim_args.quantization_approach in ["static", "weight_only"] else None
),
batch_size=(
1 if optim_args.quantization_approach == "weight_only" else training_args.per_device_train_batch_size
),
)
trainer.model = quantizer._quantized_model

if optim_args.apply_quantization and optim_args.verify_loading:
loaded_model = INCModelForCausalLM.from_pretrained(training_args.output_dir)
tokens = tokenizer("This is a sample input", return_tensors="pt")
Expand Down
4 changes: 1 addition & 3 deletions examples/neural_compressor/text-generation/run_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,9 +368,7 @@ def calibration_fn(p_model):

args.length = adjust_length_to_model(
args.length,
max_sequence_length=model.config.max_position_embeddings
if hasattr(model.config, "max_position_embeddings")
else 0,
max_sequence_length=getattr(model.config, "max_position_embeddings", 0),
)
logger.info(args)

Expand Down
2 changes: 1 addition & 1 deletion optimum/intel/neural_compressor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from ..utils.import_utils import is_diffusers_available
from ..utils.import_utils import is_diffusers_available, is_intel_extension_for_transformers_available
from .configuration import INCConfig
from .modeling_base import (
INCModel,
Expand Down
31 changes: 30 additions & 1 deletion optimum/intel/neural_compressor/modeling_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,12 @@
from optimum.intel.generation import BaseModelForCausalLM

from ...modeling_base import OptimizedModel
from ..utils.import_utils import _torch_version, is_torch_version
from ..utils.import_utils import (
_torch_version,
is_intel_extension_for_transformers_available,
is_torch_version,
requires_backends,
)
from .configuration import INCConfig
from .utils import WEIGHTS_NAME

Expand All @@ -63,6 +68,11 @@
"""


if is_intel_extension_for_transformers_available():
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM as ITREX_WOQ_MODEL
from intel_extension_for_transformers.transformers.utils import WeightOnlyQuantConfig


class INCModel(OptimizedModel):
auto_model_class = AutoModel
export_feature = "feature-extraction"
Expand Down Expand Up @@ -131,6 +141,25 @@ def _from_pretrained(
model_save_dir = Path(model_cache_path).parent
inc_config = None
msg = None
try:
requires_backends(cls, ["intel_extension_for_transformers"])
quantization_config = WeightOnlyQuantConfig.from_pretrained(model_id)
if getattr(
quantization_config, "algorithm", None
) is not None and quantization_config.algorithm.lower() in ["rtn", "gptq", "awq", "autoaround"]:
return ITREX_WOQ_MODEL.from_pretrained(
pretrained_model_name_or_path=model_id,
use_auth_token=use_auth_token,
revision=revision,
force_download=force_download,
cache_dir=cache_dir,
local_files_only=local_files_only,
subfolder=subfolder,
trust_remote_code=trust_remote_code,
**kwargs,
)
except EnvironmentError:
msg = "The model is not quantized with weight-only quantization."
try:
inc_config = INCConfig.from_pretrained(model_id)
if not is_torch_version("==", inc_config.torch_version):
Expand Down
Loading
Loading