Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support weight-only quantization with quantized operators in intel-extension-for-transformers. #455

Merged
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
86d378f
Support weight-only quantization with quantized operators in intel-ex…
PenghuiCheng Oct 16, 2023
ca58fa5
Update code style
PenghuiCheng Oct 16, 2023
4837b2f
Update readme for weight-only quantization example
PenghuiCheng Oct 16, 2023
25b2664
Update code
PenghuiCheng Oct 19, 2023
a36584b
Adapt intel-extension-for-transformers 1.3 API change
PenghuiCheng Dec 19, 2023
9ebc5a9
Support weight-only quantization with quantized operators in intel-ex…
PenghuiCheng Oct 16, 2023
d0f1c71
Update code
PenghuiCheng Oct 19, 2023
ed873c9
rebase code on main branch
PenghuiCheng Jan 16, 2024
de190fd
Update example
PenghuiCheng Jan 17, 2024
59a1f81
merge from main branch
PenghuiCheng Feb 21, 2024
416b528
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 13, 2024
d62964a
[OV]: Fixed inference after 4 bit weight compression (#569)
AlexKoff88 Feb 21, 2024
65d5a97
Updated docs with load_in_4bit (#558)
AlexKoff88 Feb 21, 2024
bddd203
Update Transformers dependency requirements (#571)
echarlaix Feb 21, 2024
70a6373
Fix compatibility for latest transformers release (#570)
echarlaix Feb 22, 2024
1437b1b
Deprecate compression options (#565)
echarlaix Feb 27, 2024
32e8fa2
Add default quantization int4 config for Mixtral-8x7B (#576)
ljaljushkin Feb 28, 2024
e3f009b
Update stable diffusion example requirements (#579)
helena-intel Feb 29, 2024
6621611
Fix collecting duplicate tensors in quantization calibration dataset …
nikita-savelyevv Mar 1, 2024
ca33bed
Save an openvino config summarizing all information related to quanti…
echarlaix Mar 1, 2024
22bc3d0
Fix warning (#582)
echarlaix Mar 4, 2024
b56021e
Add reference to the temporary directory for windows fix (#581)
echarlaix Mar 4, 2024
8e68d38
Fix documentation (#583)
echarlaix Mar 4, 2024
26977f8
Add llama test model to cover MQA (#585)
jiqing-feng Mar 6, 2024
7516637
Include nncf in openvino extra (#586)
eaidova Mar 6, 2024
246c829
Fix title documentation (#588)
echarlaix Mar 6, 2024
4c481e6
Update OpenVINO documentation links in README.md (#587)
kblaszczak-intel Mar 7, 2024
126a581
Fix default int8 quantization for CLI (#592)
echarlaix Mar 7, 2024
67fad65
Change model output parameter to last_hidden_states for IPEXModel (#589)
jiqing-feng Mar 8, 2024
0bcffeb
Add IPEX model patcher (#567)
jiqing-feng Mar 8, 2024
52ae0a3
Updates weight quantization section in the docs (#593)
AlexKoff88 Mar 8, 2024
b751766
Remove accelerate and onnxruntime from required dependencies (#590)
echarlaix Mar 8, 2024
7674e33
Fix OpenVINO image classification examples (#598)
echarlaix Mar 11, 2024
1e73450
Fix weights compression for OPenVINO models (#596)
eaidova Mar 11, 2024
dc14a2b
Fix default ov config (#600)
echarlaix Mar 11, 2024
de243f0
Add warning for transformers>=4.38 and OpenVINO 2024.0 (#599)
helena-intel Mar 12, 2024
345f9e5
Add hybrid quantization for StableDiffusion pipelines (#584)
l-bat Mar 12, 2024
f68486b
Show device name in _print_compiled_model_properties (#541)
helena-intel Mar 12, 2024
00cd903
Update code with comments
PenghuiCheng Mar 13, 2024
6b95933
Fixed pylint error
PenghuiCheng Mar 13, 2024
7bb1827
Merge from main branch
PenghuiCheng Mar 13, 2024
5d90b52
Update optimum/intel/neural_compressor/configuration.py
PenghuiCheng Mar 13, 2024
e804df3
Fixed example and UT for weight-only quantization
PenghuiCheng Mar 13, 2024
82c27dd
Fixed pre-ci test error
PenghuiCheng Mar 13, 2024
3ca3f60
Fixed pre-ci test error
PenghuiCheng Mar 13, 2024
0cc7c00
Fixed UT and examples error
PenghuiCheng Mar 17, 2024
3d28d4a
Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…
PenghuiCheng Mar 17, 2024
9ec53ce
Fixed pre-CI error
PenghuiCheng Mar 17, 2024
66d45c2
Fixed UT error
PenghuiCheng Mar 18, 2024
4347cee
Update tests/openvino/test_modeling_basic.py
PenghuiCheng Mar 23, 2024
68d6e90
Update examples/neural_compressor/language-modeling/README.md
PenghuiCheng Mar 23, 2024
032b0ef
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
6a6a97c
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
8e90ac8
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
88760bc
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
f51266a
Update examples/neural_compressor/language-modeling/run_clm.py
PenghuiCheng Mar 23, 2024
f970272
Load weight-only quantized model with INCModelForCausalLM
PenghuiCheng Mar 24, 2024
e5558b0
Merge from main branch and update code style
PenghuiCheng Mar 24, 2024
5ddd360
Changed parameters name for GPTQ in example
PenghuiCheng Mar 25, 2024
721dd3b
Changed parameters order in INCQuantizer.quantize
PenghuiCheng Mar 25, 2024
ac9aee8
Fixed UT error
PenghuiCheng Mar 25, 2024
d7bd27e
Update examples/neural_compressor/text-generation/run_generation.py
PenghuiCheng Mar 26, 2024
19bdf0f
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 26, 2024
dd981df
Update optimum/intel/neural_compressor/quantization.py
PenghuiCheng Mar 26, 2024
94f1ac5
Merge remote-tracking branch 'upstream/main' into penghuic/weight_onl…
PenghuiCheng Mar 27, 2024
af07192
Update import message
PenghuiCheng Mar 27, 2024
9c24871
Limit intel-extension-for-transformers version
PenghuiCheng Mar 27, 2024
1331cdc
Limit torch version for weight-only quantization
PenghuiCheng Mar 27, 2024
638f516
Fixed doc building error
PenghuiCheng Mar 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/neural_compressor/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,4 @@ respectively `dynamic`, `static`, `weight_only` or `aware_training`.

The flag `--verify_loading` can be passed along to verify that the resulting quantized model can be loaded correctly.

> **_Note:_** `weight_only` quantization_approach requires neural-compressor >= 2.3
> **_Note:_** `weight_only` quantization_approach requires neural-compressor >= 2.3 and intel-extension-for-transformers >= 1.3.
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 2 additions & 0 deletions examples/neural_compressor/language-modeling/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ torch >= 1.9
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
intel-extension-for-transformers >= 1.3
peft
51 changes: 26 additions & 25 deletions examples/neural_compressor/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,12 @@
from transformers.utils.versions import require_version

from optimum.intel.neural_compressor import INCModelForCausalLM, INCQuantizer, INCTrainer
from optimum.intel.utils.import_utils import is_intel_extension_for_transformers_available


if is_intel_extension_for_transformers_available():
from intel_extension_for_transformers.transformers.utils.config import WeightOnlyQuantConfig

os.environ["CUDA_VISIBLE_DEVICES"] = ""

# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
Expand Down Expand Up @@ -196,9 +200,9 @@ class OptimizationArguments:
default=False,
metadata={"help": "Whether or not to verify the loading of the quantized model."},
)
bits: int = field(
default=8,
metadata={"help": "Bits for weight only quantization, 1-8 bits."},
weight_dtype: str = field(
default="int8",
metadata={"help": "weight dtype for weight only quantization."},
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
)
group_size: int = field(
default=-1,
Expand Down Expand Up @@ -625,26 +629,23 @@ def compute_metrics(eval_preds):
else:
recipes = {}
if optim_args.quantization_approach == "weight_only":
op_type_dict = {
".*": {
"weight": {
"bits": optim_args.bits,
"group_size": optim_args.group_size,
"scheme": optim_args.weight_only_scheme,
"algorithm": optim_args.quantization_methodology,
},
},
}
if optim_args.quantization_methodology == "GPTQ":
gptq_args = {
"pad_max_length": block_size,
}
recipes.update({"gptq_args": gptq_args})
if not is_intel_extension_for_transformers_available():
raise ImportError(
"Didn't find out intel-etension-for-transformers package. "
"Please install packages: pip install intel-etension-for-transformers and pip install peft."
)
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
if optim_args.apply_pruning or optim_args.apply_distillation:
raise ValueError("Weight only quantization and pruning or distillation cannot be combined.")
quantization_config = WeightOnlyQuantConfig(
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
weight_dtype=optim_args.weight_dtype,
group_size=optim_args.group_size,
scheme=optim_args.weight_only_scheme,
algorithm=optim_args.quantization_methodology,
)
else:
op_type_dict = {}
quantization_config = PostTrainingQuantConfig(
approach=optim_args.quantization_approach, op_type_dict=op_type_dict, recipes=recipes
)
quantization_config = PostTrainingQuantConfig(
approach=optim_args.quantization_approach, recipes=recipes
)

if optim_args.apply_pruning:
if optim_args.end_step is None:
Expand Down Expand Up @@ -735,12 +736,12 @@ def compute_metrics(eval_preds):
calibration_dataset=train_dataset
if optim_args.quantization_approach in ["static", "weight_only"]
else None,
batch_size=1 # batch_size > 1 for GPTQ is WIP
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
if optim_args.quantization_approach == "weight_only" and optim_args.quantization_methodology == "GPTQ"
batch_size=1
if optim_args.quantization_approach == "weight_only"
else training_args.per_device_train_batch_size,
weight_only=True if optim_args.quantization_approach == "weight_only" else False,
)
trainer.model = quantizer._quantized_model

if optim_args.apply_quantization and optim_args.verify_loading:
loaded_model = INCModelForCausalLM.from_pretrained(training_args.output_dir)
tokens = tokenizer("This is a sample input", return_tensors="pt")
Expand Down
4 changes: 2 additions & 2 deletions optimum/intel/neural_compressor/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ class INCConfig(BaseConfig):

def __init__(
self,
quantization: Optional[Union[Dict, _BaseQuantizationConfig]] = None,
quantization: Optional[Union[Dict, _BaseQuantizationConfig, "WeightOnlyQuantConfig"]] = None,
pruning: Optional[Union[Dict, _BaseQuantizationConfig]] = None,
distillation: Optional[Union[Dict, _BaseQuantizationConfig]] = None,
save_onnx_model: bool = False,
Expand All @@ -50,7 +50,7 @@ def __init__(
self.save_onnx_model = save_onnx_model

@staticmethod
def _create_quantization_config(config: Union[Dict, _BaseQuantizationConfig]):
def _create_quantization_config(config):
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
# TODO : add activations_dtype and weights_dtype
if isinstance(config, _BaseQuantizationConfig):
approach = _quantization_model[config.approach]
Expand Down
184 changes: 105 additions & 79 deletions optimum/intel/neural_compressor/quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@
from ..utils.import_utils import (
_ipex_version,
_neural_compressor_version,
is_intel_extension_for_transformers_available,
is_ipex_version,
is_neural_compressor_version,
)
Expand All @@ -76,6 +77,14 @@
from .utils import INCDataLoader, _cfgs_to_fx_cfgs


if is_intel_extension_for_transformers_available():
from intel_extension_for_transformers.llm.quantization.utils import convert_to_quantized_model
from intel_extension_for_transformers.transformers.utils.config import WeightOnlyQuantConfig

Config = Union[PostTrainingQuantConfig, WeightOnlyQuantConfig]
else:
Config = PostTrainingQuantConfig

logger = logging.getLogger(__name__)

NEURAL_COMPRESSOR_MINIMUM_VERSION = "2.1.0"
Expand Down Expand Up @@ -143,8 +152,8 @@ def from_pretrained(cls, model: PreTrainedModel, **kwargs):

def quantize(
self,
quantization_config: "PostTrainingQuantConfig",
save_directory: Union[str, Path],
quantization_config: Config = None,
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
calibration_dataset: Dataset = None,
batch_size: int = 8,
data_collator: Optional[DataCollator] = None,
Expand All @@ -157,7 +166,7 @@ def quantize(
Quantize a model given the optimization specifications defined in `quantization_config`.

Args:
quantization_config (`PostTrainingQuantConfig`):
quantization_config (`Union[PostTrainingQuantConfig, WeightOnlyQuantConfig]`):
The configuration containing the parameters related to quantization.
save_directory (`Union[str, Path]`):
The directory where the quantized model should be saved.
Expand All @@ -177,30 +186,36 @@ def quantize(
save_directory.mkdir(parents=True, exist_ok=True)
save_onnx_model = kwargs.pop("save_onnx_model", False)

if save_onnx_model and isinstance(self._original_model, ORTModel):
if save_onnx_model and (isinstance(self._original_model, ORTModel) or weight_only):
save_onnx_model = False
logger.warning("Model provided is an ONNX model, `save_onnx_model` is set to False")

default_name = WEIGHTS_NAME if not isinstance(self._original_model, ORTModel) else ONNX_WEIGHTS_NAME
calibration_dataloader = None
self._set_task()

if weight_only:
if weight_only or not isinstance(quantization_config, PostTrainingQuantConfig):
# check neural-compressor version
if is_neural_compressor_version("<", NEURAL_COMPRESSOR_WEIGHT_ONLY_MINIMUM_VERSION):
raise ImportError(
f"Found an incompatible version of neural-compressor. Found version {_neural_compressor_version}, "
f"but only version {NEURAL_COMPRESSOR_WEIGHT_ONLY_MINIMUM_VERSION} or higher supports weight-only quantization."
)
if not is_intel_extension_for_transformers_available():
raise ImportError(
"Didn't find out intel-etension-for-transformers package. "
"Please install packages: pip install intel-etension-for-transformers and pip install peft."
)

# If op_type_dict of quantization_config is not defined, it will use default values for weight-only quantization:
# {"bits": 4, "group_size": 32, "scheme": "sym", "algorithm": "RTN"}
if isinstance(quantization_config.op_type_dict, dict) and len(quantization_config.op_type_dict) > 0:
algo = []
for _, val in quantization_config.op_type_dict.items():
algo += val.get("weight", {}).get("algorithm", ["RTN"])
else:
if quantization_config is None:
quantization_config = WeightOnlyQuantConfig()
algo = ["RTN"]
elif isinstance(quantization_config, WeightOnlyQuantConfig):
algo = quantization_config.algorithm
else:
raise TypeError(
f"For weight-only quantization, `quantization_config` should be an instance of `WeightOnlyQuantConfig`, but got: {type(quantization_config)} instead."
)

if calibration_dataset is None and ("GPTQ" in algo or "AWQ" in algo):
raise ValueError(
Expand All @@ -217,6 +232,9 @@ def quantize(
data_collator=data_collator,
use_label=False if "GPTQ" in algo else True,
)
quantization_config.calib_dataloader = calibration_dataloader
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved

save_onnx_model = False

elif INCQuantizationMode(quantization_config.approach) == INCQuantizationMode.STATIC:
# Since PyTorch fx trace does not really require an example_inputs, only need calibration_dataset or calibration_fn here.
Expand Down Expand Up @@ -249,7 +267,8 @@ def quantize(
save_onnx_model = False

if (
quantization_config.backend == "ipex"
isinstance(quantization_config, PostTrainingQuantConfig)
and quantization_config.backend == "ipex"
and is_ipex_version("<", IPEX_MINIMUM_VERSION)
and "generation" in self.task
):
Expand All @@ -258,76 +277,83 @@ def quantize(
f"but only version {IPEX_MINIMUM_VERSION} or higher is supported."
)

if isinstance(self._original_model.config, PretrainedConfig):
self._original_model.config.backend = quantization_config.backend

if isinstance(self._original_model, ORTModel):
# TODO : enable seq2seq models
if isinstance(self._original_model, ORTModelForConditionalGeneration):
raise RuntimeError("ORTModelForConditionalGeneration not supported for quantization")

if isinstance(self._original_model, ORTModelForCausalLM):
model_or_path = self._original_model.onnx_paths
if len(model_or_path) > 1:
raise RuntimeError(
f"Too many ONNX model files were found in {self._original_model.onnx_paths}, only `use_cache=False` is supported"
)
model_or_path = str(model_or_path[0])
default_name = ONNX_DECODER_NAME
else:
model_or_path = str(self._original_model.model_path)
if not isinstance(quantization_config, PostTrainingQuantConfig):
self._quantized_model = convert_to_quantized_model(self._original_model, quantization_config)
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
# Save the quantized model
output_path = save_directory.joinpath(file_name or default_name)
self._quantized_model.save_pretrained(output_path)
else:
model_or_path = self._original_model

compressed_model = fit(
model_or_path,
conf=quantization_config,
calib_dataloader=calibration_dataloader,
eval_func=self.eval_fn,
calib_func=self.calibration_fn,
)

if not hasattr(compressed_model, "_model") or compressed_model._model is None:
raise RuntimeError(
"The maximum number of trials specified has been reached and no quantized model meeting the specified"
" accuracy tolerance has been found. Either the tolerance or the number of trials need to be increased."
if isinstance(self._original_model.config, PretrainedConfig):
self._original_model.config.backend = quantization_config.backend

if isinstance(self._original_model, ORTModel):
# TODO : enable seq2seq models
if isinstance(self._original_model, ORTModelForConditionalGeneration):
raise RuntimeError("ORTModelForConditionalGeneration not supported for quantization")

if isinstance(self._original_model, ORTModelForCausalLM):
model_or_path = self._original_model.onnx_paths
if len(model_or_path) > 1:
raise RuntimeError(
f"Too many ONNX model files were found in {self._original_model.onnx_paths}, only `use_cache=False` is supported"
)
model_or_path = str(model_or_path[0])
default_name = ONNX_DECODER_NAME
else:
model_or_path = str(self._original_model.model_path)
else:
model_or_path = self._original_model

compressed_model = fit(
model_or_path,
conf=quantization_config,
calib_dataloader=calibration_dataloader,
eval_func=self.eval_fn,
calib_func=self.calibration_fn,
)

if isinstance(self._original_model.config, PretrainedConfig):
# If backend is IPEX, then the quantized model is JIT model which will drop the config attribute,
# so need set config from original_model.
model_config = copy.deepcopy(self._original_model.config)
model_config.torch_dtype = "int8"
if isinstance(compressed_model, IPEXModel):
model_config.torchscript = True
model_config.backend = "ipex"
elif not isinstance(compressed_model, ONNXModel):
compressed_model._model.config = model_config
model_config.save_pretrained(save_directory)

self._quantized_model = compressed_model._model

if save_onnx_model:
model_type = self._original_model.config.model_type.replace("_", "-")
model_name = getattr(self._original_model, "name", None)
onnx_config_class = TasksManager.get_exporter_config_constructor(
exporter="onnx",
model=self._original_model,
task=self.task,
model_type=model_type,
model_name=model_name,
)
onnx_config = onnx_config_class(self._original_model.config)
compressed_model.eval()
output_onnx_path = save_directory.joinpath(ONNX_WEIGHTS_NAME)
# Export the compressed model to the ONNX format
self._onnx_export(compressed_model, onnx_config, output_onnx_path)

output_path = save_directory.joinpath(file_name or default_name)
# Save the quantized model
self._save_pretrained(compressed_model, output_path)
quantization_config = INCConfig(quantization=quantization_config, save_onnx_model=save_onnx_model)
quantization_config.save_pretrained(save_directory)
if not hasattr(compressed_model, "_model") or compressed_model._model is None:
raise RuntimeError(
"The maximum number of trials specified has been reached and no quantized model meeting the specified"
" accuracy tolerance has been found. Either the tolerance or the number of trials need to be increased."
)

if isinstance(self._original_model.config, PretrainedConfig):
# If backend is IPEX, then the quantized model is JIT model which will drop the config attribute,
# so need set config from original_model.
model_config = copy.deepcopy(self._original_model.config)
model_config.torch_dtype = "int8"
if isinstance(compressed_model, IPEXModel):
model_config.torchscript = True
model_config.backend = "ipex"
elif not isinstance(compressed_model, ONNXModel):
compressed_model._model.config = model_config
model_config.save_pretrained(save_directory)

self._quantized_model = compressed_model._model

if save_onnx_model:
model_type = self._original_model.config.model_type.replace("_", "-")
model_name = getattr(self._original_model, "name", None)
onnx_config_class = TasksManager.get_exporter_config_constructor(
exporter="onnx",
model=self._original_model,
task=self.task,
model_type=model_type,
model_name=model_name,
)
onnx_config = onnx_config_class(self._original_model.config)
compressed_model.eval()
output_onnx_path = save_directory.joinpath(ONNX_WEIGHTS_NAME)
# Export the compressed model to the ONNX format
self._onnx_export(compressed_model, onnx_config, output_onnx_path)

output_path = save_directory.joinpath(file_name or default_name)
# Save the quantized model
self._save_pretrained(compressed_model, output_path)
quantization_config = INCConfig(quantization=quantization_config, save_onnx_model=save_onnx_model)
quantization_config.save_pretrained(save_directory)
return self._quantized_model

@staticmethod
def _save_pretrained(model: Union[PyTorchModel, IPEXModel], output_path: str):
Expand Down
Loading
Loading