Add tensor caching option to InferRequestWrapper #632

nikita-savelyevv · 2024-03-25T16:33:46Z

What does this PR do?

Problem
InferRequestWrapper object is imported and used in whisper notebook at openvino_notebooks.

After PR #577 was merged, the memory requirements for quantizing whisper large-v2 model increased dramatically. This leads to OOM when running quantization on a decent sized calibration dataset. For example, I get OOM on a dataset of 30 calibration audio files on machine with 125 GB.

The reason is that the whisper decoder_with_past model has KV input for each decoder block. There are 32 blocks, so 64 inputs in total. All these inputs accept tensors of at least 12 x 512 dimension. Also, since single audio input generates tens of decoder inputs the factor in increased even more.

Solution
For that particular use case input tensors overlap between neighboring samples, so I added a tensor cache to InferRequestWrapper object. This significantly reduces memory overhead, but results in slight performance overhead due to tensor hash computation.

I disabled this option by default and now it's only used in tests. This is because of the reason above and the fact that it's not actually clear whether this has any benefits for the models supported for quantization by optimum. Please note, that quantization of Seq2Seq models for which this fix is added is not yet supported, but rather InferRequestWrapper is imported elsewhere for manual quantization. Despite this, I still think that this functionality may become useful in the future, for example when we support quantization of Seq2Seq models directly through optimum.

I also added docs for InferRequestWrapper class.

Related ticket
136705

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

nikita-savelyevv · 2024-03-25T16:37:06Z

@AlexKoff88 please take a look

HuggingFaceDocBuilderDev · 2024-03-25T16:46:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AlexKoff88 · 2024-03-26T11:18:46Z

optimum/intel/openvino/quantization.py

@@ -82,21 +82,62 @@ def batch_size(self):


 class InferRequestWrapper:
-    def __init__(self, request, data_cache=None):
+    """
+    Wrapper class for OV InferRequest or CompiledModel objects that collects inputs which they were called with to


We introduced .prepare_inputs() method for decoder models some time ago that can be used to collect inputs (and it is used actually). @nikita-savelyevv, can you please check if we still need this w/a with patching InferRequest?

@AlexKoff88 as I understand, .prepare_inputs() is implemented only for OVModelForCausalLM, but this is not the only model type that quantization is supported for. For example, BERT is an instance of OVModelForSequenceClassification which does not have .prepare_inputs() implemented and we still have to rely on InferRequestWrapper.

Changes: - Use `InferRequestWrapper` with `apply_caching=True` which lowers memory requirements for quantization. Please see more details at huggingface/optimum-intel#632 . - Embed calibration data collection into `try - finally` block. This is helpful to unwrap the model back in case data collection fails.

Added caching option to InferRequestWrapper

7754721

nikita-savelyevv added 2 commits March 25, 2024 17:50

Black

aa183ff

ruff

a978d46

AlexKoff88 reviewed Mar 26, 2024

View reviewed changes

AlexKoff88 approved these changes Mar 27, 2024

View reviewed changes

AlexKoff88 assigned echarlaix Mar 27, 2024

echarlaix approved these changes Mar 28, 2024

View reviewed changes

echarlaix merged commit 10ac43e into huggingface:main Mar 28, 2024
10 checks passed

nikita-savelyevv mentioned this pull request Mar 28, 2024

Employ data caching when using InferRequestWrapper openvinotoolkit/openvino_notebooks#1861

Merged

This was referenced May 24, 2024

Fix nncf quantization for decoder models #727

Merged

Fix bloom generation #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tensor caching option to InferRequestWrapper #632

Add tensor caching option to InferRequestWrapper #632

nikita-savelyevv commented Mar 25, 2024 •

edited

Loading

nikita-savelyevv commented Mar 25, 2024

HuggingFaceDocBuilderDev commented Mar 25, 2024

AlexKoff88 Mar 26, 2024

nikita-savelyevv Mar 26, 2024

Add tensor caching option to InferRequestWrapper #632

Add tensor caching option to InferRequestWrapper #632

Conversation

nikita-savelyevv commented Mar 25, 2024 • edited Loading

What does this PR do?

Before submitting

nikita-savelyevv commented Mar 25, 2024

HuggingFaceDocBuilderDev commented Mar 25, 2024

AlexKoff88 Mar 26, 2024

Choose a reason for hiding this comment

nikita-savelyevv Mar 26, 2024

Choose a reason for hiding this comment

nikita-savelyevv commented Mar 25, 2024 •

edited

Loading