-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV]: load and convert llms in original precision #778
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
02528c5
to
6a2e507
Compare
dtype is None | ||
and framework == "pt" | ||
and not do_gptq_patching | ||
and task.startswith("text-generation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only for text-generation tasks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in future we will propagate for other model types. text-generation models (especially LLMs) in most cases suffer for ineffective weights conversion when we load weights saved in float16/bfloat16 and implicit convert them to float32 when call from_pretrained method. The idea behind these changes is allowing load model as is for avoid weights conversion (it reduces memory footprint, e.g. instead of ~27GB RAM for 7b models you need only 13.5GB for that) and also speedup conversion of some operations with weights like linear or matmul converting them directly to ov without passing tracing.
@@ -361,6 +391,7 @@ class StoreAttr(object): | |||
preprocessors=preprocessors, | |||
device=device, | |||
trust_remote_code=trust_remote_code, | |||
patch_16bit_model=patch_16bit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we instead update and use ov_config.dtype
instead of adding patch_16bit_model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layers patching is time consuming process we would like to avoid doing that for cases where it is not really neseccary. We need it only for pytorch model wegiths represented in fp16 and bf16 format for avoid issues tracing on cpu (torchscript does not support some ops running on cpu in this format). ov_config.dtype currently used for output model type and signal to compress weights to fp16 after conversion, mixing them in 1 parameter may lead to some issues:
- inability to convert fp32 pytorch models to fp16 openvino model (now it is possible)
- compressing model weights twice (having model in fp16 on torch side and then repeat compression for openvino)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot @eaidova
@echarlaix thanks, we still investigating impact on models accuracy and quantization on our side. Could you please do not merge these changes, until we do not have whole picture? |
d0e188f
to
d8ea169
Compare
8625e36
to
431f815
Compare
a77b351
to
afc5dc7
Compare
afc5dc7
to
079810e
Compare
@IlyasMoutawwakil could you please merge? |
What does this PR do?
allow loading bfloat16 and float16 models in original precision for conversion. It significantly reduces memory consumption and loading time during model conversion for large models
Fixes # (issue)
Before submitting