[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

kshpv · 2025-01-08T09:53:48Z

Changes

Added data-aware support for the Torch backend for WeightCompression with Scale Estimation.
Introduced support for MeanVarianceReducer, MaxVarianceReducer, and MeanAbsMaxReducer.
Incorporated torch.inference_mode() context for WeightCompression.

Reason for changes

These changes enable the utilization of data-aware Scale Estimation for the Torch backend, specifically leveraging CUDA devices for improved performance.

Related tickets

Ticket ID: 158974

Tests

Added a template for WeightCompression tests for both Torch and OV backends, covering data-aware and Scale Estimation scenarios.
Extended the test scope to include tinyllama_data_aware and tinyllama_scale_estimation_per_channel for Torch.
Added a new test case tinyllama_scale_estimation_group_size_64 for both Torch and OV backends.

Performance Metrics

Note: All CUDA results are obtained locally on a single RTX 3090.

Model	Backend	Metric Name	Metric Value	Num int4	Num int8	Compression Time (from Performance Job)	RAM MiB (from Performance Job)
tinyllama_data_aware	OV	Similarity	0.8577	94	124	0:01:28	8545
tinyllama_data_aware	TORCH	Similarity	0.8577	94	124	0:02:15	1225
tinyllama_data_aware	TORCH (CUDA)	Similarity	0.8577	94	124	0:00:28	-
tinyllama_scale_estimation_per_channel	OV	Similarity	0.8139	188	124	0:02:57	8681
tinyllama_scale_estimation_per_channel	TORCH	Similarity	0.8139	188	124	0:03:25	5472
tinyllama_scale_estimation_per_channel	TORCH (CUDA)	Similarity	0.8139	188	124	0:00:35	-
tinyllama_scale_estimation_group_size_64	OV	Similarity	0.8566	94	124	0:04:17	8681
tinyllama_scale_estimation_group_size_64	TORCH	Similarity	0.8566	94	124	0:04:01	5575
tinyllama_scale_estimation_group_size_64	TORCH (CUDA)	Similarity	0.8566	94	124	0:00:36	-

tests/common/experimental/test_reducers_and_aggregators.py

…torch

kshpv · 2025-01-08T11:48:23Z

weight compression build - 291

kshpv · 2025-01-08T11:51:33Z

The proposed example can be added as a follow-up PR - will be excluded from this PR

tests/common/experimental/test_reducers_and_aggregators.py

kshpv · 2025-01-17T16:26:05Z

nncf/quantization/algorithms/weight_compression/torch_backend.py

+def get_compress_fn(config):
+    def _forward_fn(inputs):
+        if len(inputs) == 3:
+            tensor, scale, zero_point = inputs
+            tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point)
+        else:
+            tensor, scale = inputs
+            tensor, scale = Tensor(tensor), Tensor(scale)
+            zero_point = None
+        quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config)
+        return quantized.data
+
+    return _forward_fn
+
+
+def get_compress_decompress_fn(config):
+    def _forward_fn(inputs):
+        if len(inputs) == 3:
+            tensor, scale, zero_point = inputs
+            tensor, scale, zero_point = Tensor(tensor), Tensor(scale), Tensor(zero_point)
+        else:
+            tensor, scale = inputs
+            tensor, scale = Tensor(tensor), Tensor(scale)
+            zero_point = None
+        quantized = calculate_quantized_weight(tensor, scale=scale, zero_point=zero_point, config=config)
+        dequantized = do_int_dequantization(quantized, scale=scale, zero_point=zero_point)
+        return dequantized.data
+
+    return _forward_fn


Should be removed in #2727

kshpv · 2025-01-20T07:59:35Z

Results from conformance job:

Model	Backend	Metric name	Metric value	Metric diff	Num int4	Num int8	Compr. time	Total time	RAM MiB
tinyllama_data_aware	TORCH	Similarity	0.8577	-0.1423	94	124	1:54:26	2:18:31	6912
tinyllama_data_aware	OV	Similarity	0.8577	-0.1423	94	124	0:02:36	0:17:57	13828
tinyllama_scale_estimation_per_channel	OV	Similarity	0.8139	-0.1861	188	124	0:07:56	0:24:14	13881
tinyllama_scale_estimation_per_channel	TORCH	Similarity	0.8139	-0.1861	188	124	2:40:47	3:05:47	55553

kshpv · 2025-01-20T12:00:26Z

ptq performance job - 36

kshpv · 2025-01-20T12:08:30Z

Performance numbers with adding torch.no_grad():

Model	Num int4	Num int8	Compr. time	Stat. collection time	Mixed-Precision search time	Apply Compression time	Total time	RAM MiB	RAM MiB System
tinyllama_data_aware	94	124	0:01:26	0:01:16	0:00:03	0:00:02	0:02:15	1225	1201
tinyllama_int4_data_free	114	84	0:00:09		0:00:03	0:00:02	0:00:47	5617	1401
tinyllama_int8_data_free	0	312	0:00:06			0:00:02	0:00:39	5585	1356
tinyllama_scale_estimation_per_channel	188	124	0:02:50	0:01:20	0:00:03	0:00:01	0:03:25	5472	1311

kshpv · 2025-01-20T13:19:14Z

Locally, I tested the PR on cuda and it works and demonstrates the same accuracy as the CPU for tinyllama with reduced compression time a lot. Some changes are needed to enable cuda for conformance as well as enabling jobs with GPUs.
So, I think it can be done in the following PRs.

nncf/experimental/torch/fx/quantization/quantize_model.py

tests/post_training/data/wc_reference_data.yaml

ljaljushkin · 2025-01-21T10:38:30Z

nncf/torch/quantization/layers.py

@@ -1094,7 +1094,7 @@ def __init__(self, scale: torch.Tensor, zero_point: torch.Tensor, result_dtype:
        """
        super().__init__()
        self.register_buffer("_scale", scale.type(dtype=torch.float16))
-        self.register_buffer("_zero_point", self.pack_weight(zero_point))
+        self.register_buffer("_zero_point", self.pack_weight(zero_point.type(dtype=torch.uint8)))


@alexsu52 please take a look

@kshpv, I would like to understand your motivation.

Motivation is that zero_point is used in ScaleEstimation quantization formula as float type, not low precision type. See these:
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/scale_estimation.py#L226
https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L377
The conversion between types from float precision to low precision for OpenVINO is kinda hidden - it applies in the model transformation step - https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/algorithms/weight_compression/openvino_backend.py#L252
So, I basically did the same for Torch and transform zero_point to expected dtype in model transformation step.

andreyanufr · 2025-01-21T10:59:30Z

tests/openvino/native/models.py

@@ -1185,3 +1185,18 @@ def _create_ov_model(self):

        model = ov.Model([sin_result, cos_result], [position_ids])
        return model
+
+
+class MLP(OVReferenceModel):


To be honest it is not MLP. There is one layer.

andreyanufr · 2025-01-21T11:04:35Z

tests/torch/ptq/test_weights_compression.py

+        return LinearModel(torch.arange(0, 32 * 32, dtype=torch.float32).reshape(32, 32))
+
+    @staticmethod
+    def get_scale_estimation_ref():


is it possible to move reference to json file like int this test:

nncf/tests/openvino/native/quantization/test_weights_compression.py

Line 255 in 0931072

ref_stats_path = get_actual_reference_for_current_openvino(

?

Reduce the scales number. Should look much better

ljaljushkin · 2025-01-21T11:06:01Z

tests/cross_fw/test_templates/template_test_weights_compression.py

+
+        # prepare dataset with one input tensor
+        input = np.arange(0, 32 * 32, dtype=np.float32).reshape(1, 32, 32)
+        input[0, 15] *= 100  # make one channel relatively higher.


I would expect this to be reflected in the references, but I don't see anything outstanding. Maybe the value should be higher.

I think that this test intends to check the difference between torch and OV backends. It does not aim to check the algorithm's correctness. However, I agree that your proposal is good. I can add a new test which will check the error after quantization and demonstrate that the outlier channel has the lowest error against others

ljaljushkin · 2025-01-21T11:24:42Z

nncf/experimental/common/tensor_statistics/collectors.py

+    def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]:
+        x = x[0]


Not for this PR.
Should this function really take the list?
Seems like it always contains one element. Is it only RawReducer that consumes the whole list and the rest works with one element?
Can we inherit all these "one-element" classes from a class that defines a method with a single element?
IMO, it would be more clear when one element expected and when not.
@daniil-lyakhov

Reducers was designed to receive several inputs and output several outputs as well. For now, there is no such reducers, but the possible use case - quantization error (fp32 input, int8 input) -> diff

We can create a class for that, it will make hierarchy tree more complicated, we than should introduce method like _reduce_out_of_place_one_input method, and I don't think this will make code more readable

tests/torch/ptq/test_weights_compression.py

ljaljushkin · 2025-01-22T15:08:41Z

tests/torch/ptq/test_weights_compression.py

+
+    @staticmethod
+    def check_weights(model: torch.nn.Module, ref_ids: List[int]) -> None:
+        low_precision_nodes = {f"{i}_weight" for i in ref_ids}


Please move model names from the testing code to model.

all_names = model.get_weight_names_in_exec_order() low_precision_nodes = list(map(lambda i: all_names[i], ref_ids)) for op_name, op in model.nncf.external_op.items(): for name in low_precision_nodes: if name in op_name: assert isinstance(op, INT4SymmetricWeightsDecompressor)

kshpv added 5 commits January 3, 2025 16:02

add torch sample

09126af

upd sample

67cef71

fix reducers

94d2850

align SE with GPTQ

db42165

add tests

f96788a

kshpv requested a review from a team as a code owner January 8, 2025 09:53

github-actions bot added NNCF PT Pull requests that updates NNCF PyTorch NNCF Common Pull request that updates NNCF Common experimental NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ labels Jan 8, 2025

kshpv changed the title ~~[TorchScale est torch~~ [Torch][WeightCompression] Add Scale Estimation support Jan 8, 2025

kshpv marked this pull request as draft January 8, 2025 09:54

kshpv added 4 commits January 8, 2025 10:58

backend method - get_filter_fn_for_statistics

b1d4c47

fixes

51ccdd6

Merge remote-tracking branch 'remote/develop' into scale_est_torch

e6a9191

sample

e37ef52

daniil-lyakhov reviewed Jan 8, 2025

View reviewed changes

tests/common/experimental/test_reducers_and_aggregators.py Outdated Show resolved Hide resolved

kshpv added 4 commits January 8, 2025 12:15

add tinyllama_data_aware, tinyllama_scale_estimation_per_channel for …

d1843ad

…torch

fix precommit

cd79e80

Merge remote-tracking branch 'remote/develop' into scale_est_torch

bc0731c

minor

df6b43b

MaximProshin changed the title ~~[Torch][WeightCompression] Add Scale Estimation support~~ [Torch][WeightCompression] Add Scale Estimation data-aware support Jan 8, 2025

refactor test

368054a

add WA for dataset

e2a6f46

daniil-lyakhov reviewed Jan 8, 2025

View reviewed changes

tests/common/experimental/test_reducers_and_aggregators.py Outdated Show resolved Hide resolved

kshpv added 2 commits January 8, 2025 13:07

fix

dbf2b1d

dtype

702f8b1

kshpv commented Jan 17, 2025

View reviewed changes

kshpv added 2 commits January 20, 2025 11:26

rollback no_grad

9d0acdb

add torch.no_grad()

37a41ac

start of cuda in conformance

a305fac

andreyanufr approved these changes Jan 20, 2025

View reviewed changes

ljaljushkin reviewed Jan 21, 2025

View reviewed changes

nncf/experimental/torch/fx/quantization/quantize_model.py Outdated Show resolved Hide resolved

ljaljushkin reviewed Jan 21, 2025

View reviewed changes

tests/post_training/data/wc_reference_data.yaml Show resolved Hide resolved

ljaljushkin reviewed Jan 21, 2025

View reviewed changes

kshpv added 2 commits January 21, 2025 11:53

add scale estimation test

ddee495

upd year

f89ae9d

andreyanufr reviewed Jan 21, 2025

View reviewed changes

ljaljushkin reviewed Jan 21, 2025

View reviewed changes

ljaljushkin requested changes Jan 21, 2025

View reviewed changes

tests/torch/ptq/test_weights_compression.py Outdated Show resolved Hide resolved

kshpv added 5 commits January 21, 2025 14:07

add tinyllama_scale_estimation_group_size_64

026a0ed

torch.no_grad -> torch.inference_mode

e3f12c2

upd reference

a347a25

#test: upd int4 weight locator for torch

601f2e4

upd licence year

32bc0e5

kshpv requested review from ljaljushkin and alexsu52 January 22, 2025 14:48

ljaljushkin requested changes Jan 22, 2025

View reviewed changes

kshpv added 3 commits January 23, 2025 10:23

Merge remote-tracking branch 'remote/develop' into scale_est_torch

f8d6451

merge

7557fb5

rebase

568809c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

kshpv commented Jan 8, 2025 •

edited

Loading

kshpv commented Jan 8, 2025

kshpv commented Jan 8, 2025 •

edited

Loading

kshpv Jan 17, 2025

kshpv commented Jan 20, 2025

kshpv commented Jan 20, 2025

kshpv commented Jan 20, 2025 •

edited

Loading

kshpv commented Jan 20, 2025

ljaljushkin Jan 21, 2025 •

edited

Loading

alexsu52 Jan 22, 2025

kshpv Jan 22, 2025 •

edited

Loading

andreyanufr Jan 21, 2025

kshpv Jan 21, 2025

andreyanufr Jan 21, 2025

kshpv Jan 21, 2025

ljaljushkin Jan 21, 2025

kshpv Jan 21, 2025

ljaljushkin Jan 21, 2025

daniil-lyakhov Jan 22, 2025

ljaljushkin Jan 22, 2025

kshpv Jan 23, 2025

		def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]:
		x = x[0]

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

Are you sure you want to change the base?

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

Conversation

kshpv commented Jan 8, 2025 • edited Loading

Changes

Reason for changes

Related tickets

Tests

Performance Metrics

kshpv commented Jan 8, 2025

kshpv commented Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

kshpv commented Jan 20, 2025

kshpv commented Jan 20, 2025

kshpv commented Jan 20, 2025 • edited Loading

kshpv commented Jan 20, 2025

ljaljushkin Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshpv Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshpv commented Jan 8, 2025 •

edited

Loading

kshpv commented Jan 8, 2025 •

edited

Loading

kshpv commented Jan 20, 2025 •

edited

Loading

ljaljushkin Jan 21, 2025 •

edited

Loading

kshpv Jan 22, 2025 •

edited

Loading