Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

Open
wants to merge 44 commits into
base: develop
Choose a base branch
from

Conversation

kshpv
Copy link
Collaborator

@kshpv kshpv commented Jan 8, 2025

Changes

Added data-aware support for the Torch backend for WeightCompression with Scale Estimation.
Introduced support for MeanVarianceReducer, MaxVarianceReducer, and MeanAbsMaxReducer.
Incorporated torch.inference_mode() context for WeightCompression.

Reason for changes

These changes enable the utilization of data-aware Scale Estimation for the Torch backend, specifically leveraging CUDA devices for improved performance.

Related tickets

Ticket ID: 158974

Tests

Added a template for WeightCompression tests for both Torch and OV backends, covering data-aware and Scale Estimation scenarios.
Extended the test scope to include tinyllama_data_aware and tinyllama_scale_estimation_per_channel for Torch.
Added a new test case tinyllama_scale_estimation_group_size_64 for both Torch and OV backends.

Performance Metrics

Note: All CUDA results are obtained locally on a single RTX 3090.

Model Backend Metric Name Metric Value Num int4 Num int8 Compression Time (from Performance Job) RAM MiB (from Performance Job)
tinyllama_data_aware OV Similarity 0.8577 94 124 0:01:28 8545
tinyllama_data_aware TORCH Similarity 0.8577 94 124 0:02:15 1225
tinyllama_data_aware TORCH (CUDA) Similarity 0.8577 94 124 0:00:28 -
tinyllama_scale_estimation_per_channel OV Similarity 0.8139 188 124 0:02:57 8681
tinyllama_scale_estimation_per_channel TORCH Similarity 0.8139 188 124 0:03:25 5472
tinyllama_scale_estimation_per_channel TORCH (CUDA) Similarity 0.8139 188 124 0:00:35 -
tinyllama_scale_estimation_group_size_64 OV Similarity 0.8566 94 124 0:04:17 8681
tinyllama_scale_estimation_group_size_64 TORCH Similarity 0.8566 94 124 0:04:01 5575
tinyllama_scale_estimation_group_size_64 TORCH (CUDA) Similarity 0.8566 94 124 0:00:36 -

@kshpv kshpv requested a review from a team as a code owner January 8, 2025 09:53
@github-actions github-actions bot added NNCF PT Pull requests that updates NNCF PyTorch NNCF Common Pull request that updates NNCF Common experimental NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ labels Jan 8, 2025
@kshpv kshpv changed the title [TorchScale est torch [Torch][WeightCompression] Add Scale Estimation support Jan 8, 2025
@kshpv kshpv marked this pull request as draft January 8, 2025 09:54
@MaximProshin MaximProshin changed the title [Torch][WeightCompression] Add Scale Estimation support [Torch][WeightCompression] Add Scale Estimation data-aware support Jan 8, 2025
@kshpv
Copy link
Collaborator Author

kshpv commented Jan 8, 2025

weight compression build - 291

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 8, 2025

The proposed example can be added as a follow-up PR - will be excluded from this PR

Comment on lines 64 to 92
def get_compress_fn(config):
def _forward_fn(inputs):
if len(inputs) == 3:
tensor, scale, zero_point = inputs
tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point)
else:
tensor, scale = inputs
tensor, scale = Tensor(tensor), Tensor(scale)
zero_point = None
quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config)
return quantized.data

return _forward_fn


def get_compress_decompress_fn(config):
def _forward_fn(inputs):
if len(inputs) == 3:
tensor, scale, zero_point = inputs
tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point)
else:
tensor, scale = inputs
tensor, scale = Tensor(tensor), Tensor(scale)
zero_point = None
quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config)
dequantized = do_int_dequantization(quantized, scale=scale, zero_point=zero_point)
return dequantized.data

return _forward_fn
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed in #2727

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 20, 2025

Results from conformance job:

Model Backend Metric name Metric value Metric diff Num int4 Num int8 Compr. time Total time RAM MiB
tinyllama_data_aware TORCH Similarity 0.8577 -0.1423 94 124 1:54:26 2:18:31 6912
tinyllama_data_aware OV Similarity 0.8577 -0.1423 94 124 0:02:36 0:17:57 13828
tinyllama_scale_estimation_per_channel OV Similarity 0.8139 -0.1861 188 124 0:07:56 0:24:14 13881
tinyllama_scale_estimation_per_channel TORCH Similarity 0.8139 -0.1861 188 124 2:40:47 3:05:47 55553

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 20, 2025

ptq performance job - 36

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 20, 2025

Performance numbers with adding torch.no_grad():

Model Num int4 Num int8 Compr. time Stat. collection time Mixed-Precision search time Apply Compression time Total time RAM MiB RAM MiB System
tinyllama_data_aware 94 124 0:01:26 0:01:16 0:00:03 0:00:02 0:02:15 1225 1201
tinyllama_int4_data_free 114 84 0:00:09   0:00:03 0:00:02 0:00:47 5617 1401
tinyllama_int8_data_free 0 312 0:00:06     0:00:02 0:00:39 5585 1356
tinyllama_scale_estimation_per_channel 188 124 0:02:50 0:01:20 0:00:03 0:00:01 0:03:25 5472 1311

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 20, 2025

Locally, I tested the PR on cuda and it works and demonstrates the same accuracy as the CPU for tinyllama with reduced compression time a lot. Some changes are needed to enable cuda for conformance as well as enabling jobs with GPUs.
So, I think it can be done in the following PRs.

@@ -1094,7 +1094,7 @@ def __init__(self, scale: torch.Tensor, zero_point: torch.Tensor, result_dtype:
"""
super().__init__()
self.register_buffer("_scale", scale.type(dtype=torch.float16))
self.register_buffer("_zero_point", self.pack_weight(zero_point))
self.register_buffer("_zero_point", self.pack_weight(zero_point.type(dtype=torch.uint8)))
Copy link
Contributor

@ljaljushkin ljaljushkin Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsu52 please take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kshpv, I would like to understand your motivation.

Copy link
Collaborator Author

@kshpv kshpv Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation is that zero_point is used in ScaleEstimation quantization formula as float type, not low precision type. See these:
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/scale_estimation.py#L226
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L377
The conversion between types from float precision to low precision for OpenVINO is kinda hidden - it applies in the model transformation step - https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L252
So, I basically did the same for Torch and transform zero_point to expected dtype in model transformation step.

@@ -1185,3 +1185,18 @@ def _create_ov_model(self):

model = ov.Model([sin_result, cos_result], [position_ids])
return model


class MLP(OVReferenceModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest it is not MLP. There is one layer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return LinearModel(torch.arange(0, 32 * 32, dtype=torch.float32).reshape(32, 32))

@staticmethod
def get_scale_estimation_ref():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to move reference to json file like int this test:

ref_stats_path = get_actual_reference_for_current_openvino(
?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce the scales number. Should look much better


# prepare dataset with one input tensor
input = np.arange(0, 32 * 32, dtype=np.float32).reshape(1, 32, 32)
input[0, 15] *= 100 # make one channel relatively higher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be reflected in the references, but I don't see anything outstanding. Maybe the value should be higher.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this test intends to check the difference between torch and OV backends. It does not aim to check the algorithm's correctness. However, I agree that your proposal is good. I can add a new test which will check the error after quantization and demonstrate that the outlier channel has the lowest error against others

Comment on lines +467 to +468
def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]:
x = x[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR.
Should this function really take the list?
Seems like it always contains one element. Is it only RawReducer that consumes the whole list and the rest works with one element?
Can we inherit all these "one-element" classes from a class that defines a method with a single element?
IMO, it would be more clear when one element expected and when not.
@daniil-lyakhov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducers was designed to receive several inputs and output several outputs as well. For now, there is no such reducers, but the possible use case - quantization error (fp32 input, int8 input) -> diff

We can create a class for that, it will make hierarchy tree more complicated, we than should introduce method like _reduce_out_of_place_one_input method, and I don't think this will make code more readable

@kshpv kshpv requested review from ljaljushkin and alexsu52 January 22, 2025 14:48

@staticmethod
def check_weights(model: torch.nn.Module, ref_ids: List[int]) -> None:
low_precision_nodes = {f"{i}_weight" for i in ref_ids}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move model names from the testing code to model.

all_names = model.get_weight_names_in_exec_order()
low_precision_nodes = list(map(lambda i: all_names[i], ref_ids))
for op_name, op in model.nncf.external_op.items():
    for name in low_precision_nodes:
          if name in op_name:
              assert isinstance(op, INT4SymmetricWeightsDecompressor)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental NNCF Common Pull request that updates NNCF Common NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PT Pull requests that updates NNCF PyTorch NNCF PTQ Pull requests that updates NNCF PTQ
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants