-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Torch][WeightCompression] Add Scale Estimation data-aware support #3179
base: develop
Are you sure you want to change the base?
Conversation
weight compression build - 291 |
The proposed example can be added as a follow-up PR - will be excluded from this PR |
def get_compress_fn(config): | ||
def _forward_fn(inputs): | ||
if len(inputs) == 3: | ||
tensor, scale, zero_point = inputs | ||
tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point) | ||
else: | ||
tensor, scale = inputs | ||
tensor, scale = Tensor(tensor), Tensor(scale) | ||
zero_point = None | ||
quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config) | ||
return quantized.data | ||
|
||
return _forward_fn | ||
|
||
|
||
def get_compress_decompress_fn(config): | ||
def _forward_fn(inputs): | ||
if len(inputs) == 3: | ||
tensor, scale, zero_point = inputs | ||
tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point) | ||
else: | ||
tensor, scale = inputs | ||
tensor, scale = Tensor(tensor), Tensor(scale) | ||
zero_point = None | ||
quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config) | ||
dequantized = do_int_dequantization(quantized, scale=scale, zero_point=zero_point) | ||
return dequantized.data | ||
|
||
return _forward_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed in #2727
Results from conformance job:
|
ptq performance job - 36 |
Performance numbers with adding
|
Locally, I tested the PR on cuda and it works and demonstrates the same accuracy as the CPU for tinyllama with reduced compression time a lot. Some changes are needed to enable |
@@ -1094,7 +1094,7 @@ def __init__(self, scale: torch.Tensor, zero_point: torch.Tensor, result_dtype: | |||
""" | |||
super().__init__() | |||
self.register_buffer("_scale", scale.type(dtype=torch.float16)) | |||
self.register_buffer("_zero_point", self.pack_weight(zero_point)) | |||
self.register_buffer("_zero_point", self.pack_weight(zero_point.type(dtype=torch.uint8))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsu52 please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kshpv, I would like to understand your motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation is that zero_point
is used in ScaleEstimation quantization formula as float type, not low precision type. See these:
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/scale_estimation.py#L226
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L377
The conversion between types from float precision to low precision for OpenVINO is kinda hidden - it applies in the model transformation step - https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L252
So, I basically did the same for Torch and transform zero_point
to expected dtype in model transformation step.
tests/openvino/native/models.py
Outdated
@@ -1185,3 +1185,18 @@ def _create_ov_model(self): | |||
|
|||
model = ov.Model([sin_result, cos_result], [position_ids]) | |||
return model | |||
|
|||
|
|||
class MLP(OVReferenceModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest it is not MLP. There is one layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return LinearModel(torch.arange(0, 32 * 32, dtype=torch.float32).reshape(32, 32)) | ||
|
||
@staticmethod | ||
def get_scale_estimation_ref(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to move reference to json file like int this test:
ref_stats_path = get_actual_reference_for_current_openvino( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce the scales number. Should look much better
|
||
# prepare dataset with one input tensor | ||
input = np.arange(0, 32 * 32, dtype=np.float32).reshape(1, 32, 32) | ||
input[0, 15] *= 100 # make one channel relatively higher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be reflected in the references, but I don't see anything outstanding. Maybe the value should be higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this test intends to check the difference between torch and OV backends. It does not aim to check the algorithm's correctness. However, I agree that your proposal is good. I can add a new test which will check the error after quantization and demonstrate that the outlier channel has the lowest error against others
def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]: | ||
x = x[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR.
Should this function really take the list?
Seems like it always contains one element. Is it only RawReducer
that consumes the whole list and the rest works with one element?
Can we inherit all these "one-element" classes from a class that defines a method with a single element?
IMO, it would be more clear when one element expected and when not.
@daniil-lyakhov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducers was designed to receive several inputs and output several outputs as well. For now, there is no such reducers, but the possible use case - quantization error (fp32 input, int8 input) -> diff
We can create a class for that, it will make hierarchy tree more complicated, we than should introduce method like _reduce_out_of_place_one_input
method, and I don't think this will make code more readable
|
||
@staticmethod | ||
def check_weights(model: torch.nn.Module, ref_ids: List[int]) -> None: | ||
low_precision_nodes = {f"{i}_weight" for i in ref_ids} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move model names from the testing code to model.
all_names = model.get_weight_names_in_exec_order()
low_precision_nodes = list(map(lambda i: all_names[i], ref_ids))
for op_name, op in model.nncf.external_op.items():
for name in low_precision_nodes:
if name in op_name:
assert isinstance(op, INT4SymmetricWeightsDecompressor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Changes
Added data-aware support for the Torch backend for WeightCompression with Scale Estimation.
Introduced support for MeanVarianceReducer, MaxVarianceReducer, and MeanAbsMaxReducer.
Incorporated
torch.inference_mode()
context for WeightCompression.Reason for changes
These changes enable the utilization of data-aware Scale Estimation for the Torch backend, specifically leveraging CUDA devices for improved performance.
Related tickets
Ticket ID: 158974
Tests
Added a template for WeightCompression tests for both Torch and OV backends, covering data-aware and Scale Estimation scenarios.
Extended the test scope to include
tinyllama_data_aware
andtinyllama_scale_estimation_per_channel
for Torch.Added a new test case
tinyllama_scale_estimation_group_size_64
for both Torch and OV backends.Performance Metrics
Note: All CUDA results are obtained locally on a single RTX 3090.