Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV]: load and convert llms in original precision #778
[OV]: load and convert llms in original precision #778
Changes from all commits
1364063
0c0338e
177bb87
8710e4d
c0ef027
079810e
18d48fc
ca3ad7e
d8f2812
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only for text-generation tasks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in future we will propagate for other model types. text-generation models (especially LLMs) in most cases suffer for ineffective weights conversion when we load weights saved in float16/bfloat16 and implicit convert them to float32 when call from_pretrained method. The idea behind these changes is allowing load model as is for avoid weights conversion (it reduces memory footprint, e.g. instead of ~27GB RAM for 7b models you need only 13.5GB for that) and also speedup conversion of some operations with weights like linear or matmul converting them directly to ov without passing tracing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we instead update and use
ov_config.dtype
instead of addingpatch_16bit_model
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layers patching is time consuming process we would like to avoid doing that for cases where it is not really neseccary. We need it only for pytorch model wegiths represented in fp16 and bf16 format for avoid issues tracing on cpu (torchscript does not support some ops running on cpu in this format). ov_config.dtype currently used for output model type and signal to compress weights to fp16 after conversion, mixing them in 1 parameter may lead to some issues: