Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OV]: load and convert llms in original precision #778

Merged
merged 9 commits into from
Aug 19, 2024

Conversation

eaidova
Copy link
Collaborator

@eaidova eaidova commented Jun 24, 2024

What does this PR do?

allow loading bfloat16 and float16 models in original precision for conversion. It significantly reduces memory consumption and loading time during model conversion for large models

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@eaidova eaidova changed the title Ea/bf16 fp16 llms [OV]: load and convert llms in original precision Jun 24, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eaidova eaidova force-pushed the ea/bf16_fp16_llms branch 3 times, most recently from 02528c5 to 6a2e507 Compare June 25, 2024 15:52
@eaidova eaidova marked this pull request as ready for review June 25, 2024 16:17
dtype is None
and framework == "pt"
and not do_gptq_patching
and task.startswith("text-generation")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only for text-generation tasks ?

Copy link
Collaborator Author

@eaidova eaidova Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in future we will propagate for other model types. text-generation models (especially LLMs) in most cases suffer for ineffective weights conversion when we load weights saved in float16/bfloat16 and implicit convert them to float32 when call from_pretrained method. The idea behind these changes is allowing load model as is for avoid weights conversion (it reduces memory footprint, e.g. instead of ~27GB RAM for 7b models you need only 13.5GB for that) and also speedup conversion of some operations with weights like linear or matmul converting them directly to ov without passing tracing.

@@ -361,6 +391,7 @@ class StoreAttr(object):
preprocessors=preprocessors,
device=device,
trust_remote_code=trust_remote_code,
patch_16bit_model=patch_16bit,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we instead update and use ov_config.dtype instead of adding patch_16bit_model ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layers patching is time consuming process we would like to avoid doing that for cases where it is not really neseccary. We need it only for pytorch model wegiths represented in fp16 and bf16 format for avoid issues tracing on cpu (torchscript does not support some ops running on cpu in this format). ov_config.dtype currently used for output model type and signal to compress weights to fp16 after conversion, mixing them in 1 parameter may lead to some issues:

  • inability to convert fp32 pytorch models to fp16 openvino model (now it is possible)
  • compressing model weights twice (having model in fp16 on torch side and then repeat compression for openvino)

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @eaidova

@eaidova
Copy link
Collaborator Author

eaidova commented Jul 1, 2024

Looks great, thanks a lot @eaidova

@echarlaix thanks, we still investigating impact on models accuracy and quantization on our side. Could you please do not merge these changes, until we do not have whole picture?

@eaidova eaidova force-pushed the ea/bf16_fp16_llms branch from d0e188f to d8ea169 Compare July 19, 2024 07:32
@eaidova eaidova force-pushed the ea/bf16_fp16_llms branch 2 times, most recently from 8625e36 to 431f815 Compare August 1, 2024 14:31
@eaidova eaidova force-pushed the ea/bf16_fp16_llms branch 3 times, most recently from a77b351 to afc5dc7 Compare August 13, 2024 12:44
@eaidova eaidova force-pushed the ea/bf16_fp16_llms branch from afc5dc7 to 079810e Compare August 13, 2024 12:51
@eaidova
Copy link
Collaborator Author

eaidova commented Aug 19, 2024

@IlyasMoutawwakil could you please merge?

@IlyasMoutawwakil IlyasMoutawwakil merged commit e9800ce into huggingface:main Aug 19, 2024
15 of 17 checks passed
@echarlaix echarlaix mentioned this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants