Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensor caching option to InferRequestWrapper #632

Merged

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Mar 25, 2024

What does this PR do?

Problem
InferRequestWrapper object is imported and used in whisper notebook at openvino_notebooks.

After PR #577 was merged, the memory requirements for quantizing whisper large-v2 model increased dramatically. This leads to OOM when running quantization on a decent sized calibration dataset. For example, I get OOM on a dataset of 30 calibration audio files on machine with 125 GB.

The reason is that the whisper decoder_with_past model has KV input for each decoder block. There are 32 blocks, so 64 inputs in total. All these inputs accept tensors of at least 12 x 512 dimension. Also, since single audio input generates tens of decoder inputs the factor in increased even more.

Solution
For that particular use case input tensors overlap between neighboring samples, so I added a tensor cache to InferRequestWrapper object. This significantly reduces memory overhead, but results in slight performance overhead due to tensor hash computation.

I disabled this option by default and now it's only used in tests. This is because of the reason above and the fact that it's not actually clear whether this has any benefits for the models supported for quantization by optimum. Please note, that quantization of Seq2Seq models for which this fix is added is not yet supported, but rather InferRequestWrapper is imported elsewhere for manual quantization. Despite this, I still think that this functionality may become useful in the future, for example when we support quantization of Seq2Seq models directly through optimum.

I also added docs for InferRequestWrapper class.

Related ticket
136705

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@nikita-savelyevv
Copy link
Collaborator Author

@AlexKoff88 please take a look

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -82,21 +82,62 @@ def batch_size(self):


class InferRequestWrapper:
def __init__(self, request, data_cache=None):
"""
Wrapper class for OV InferRequest or CompiledModel objects that collects inputs which they were called with to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced .prepare_inputs() method for decoder models some time ago that can be used to collect inputs (and it is used actually). @nikita-savelyevv, can you please check if we still need this w/a with patching InferRequest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexKoff88 as I understand, .prepare_inputs() is implemented only for OVModelForCausalLM, but this is not the only model type that quantization is supported for. For example, BERT is an instance of OVModelForSequenceClassification which does not have .prepare_inputs() implemented and we still have to rely on InferRequestWrapper.

@echarlaix echarlaix merged commit 10ac43e into huggingface:main Mar 28, 2024
10 checks passed
eaidova pushed a commit to openvinotoolkit/openvino_notebooks that referenced this pull request Mar 29, 2024
Changes:

- Use `InferRequestWrapper` with `apply_caching=True` which lowers
memory requirements for quantization. Please see more details at
huggingface/optimum-intel#632 .
- Embed calibration data collection into `try - finally` block. This is
helpful to unwrap the model back in case data collection fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants