-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tensor caching option to InferRequestWrapper #632
Add tensor caching option to InferRequestWrapper #632
Conversation
@AlexKoff88 please take a look |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -82,21 +82,62 @@ def batch_size(self): | |||
|
|||
|
|||
class InferRequestWrapper: | |||
def __init__(self, request, data_cache=None): | |||
""" | |||
Wrapper class for OV InferRequest or CompiledModel objects that collects inputs which they were called with to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We introduced .prepare_inputs() method for decoder models some time ago that can be used to collect inputs (and it is used actually). @nikita-savelyevv, can you please check if we still need this w/a with patching InferRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexKoff88 as I understand, .prepare_inputs()
is implemented only for OVModelForCausalLM
, but this is not the only model type that quantization is supported for. For example, BERT is an instance of OVModelForSequenceClassification
which does not have .prepare_inputs()
implemented and we still have to rely on InferRequestWrapper
.
Changes: - Use `InferRequestWrapper` with `apply_caching=True` which lowers memory requirements for quantization. Please see more details at huggingface/optimum-intel#632 . - Embed calibration data collection into `try - finally` block. This is helpful to unwrap the model back in case data collection fails.
What does this PR do?
Problem
InferRequestWrapper
object is imported and used in whisper notebook at openvino_notebooks.After PR #577 was merged, the memory requirements for quantizing whisper large-v2 model increased dramatically. This leads to OOM when running quantization on a decent sized calibration dataset. For example, I get OOM on a dataset of 30 calibration audio files on machine with 125 GB.
The reason is that the whisper
decoder_with_past
model has KV input for each decoder block. There are 32 blocks, so 64 inputs in total. All these inputs accept tensors of at least 12 x 512 dimension. Also, since single audio input generates tens of decoder inputs the factor in increased even more.Solution
For that particular use case input tensors overlap between neighboring samples, so I added a tensor cache to
InferRequestWrapper
object. This significantly reduces memory overhead, but results in slight performance overhead due to tensor hash computation.I disabled this option by default and now it's only used in tests. This is because of the reason above and the fact that it's not actually clear whether this has any benefits for the models supported for quantization by optimum. Please note, that quantization of Seq2Seq models for which this fix is added is not yet supported, but rather
InferRequestWrapper
is imported elsewhere for manual quantization. Despite this, I still think that this functionality may become useful in the future, for example when we support quantization of Seq2Seq models directly through optimum.I also added docs for
InferRequestWrapper
class.Related ticket
136705
Before submitting