Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify xpu and cpu backend and use paged attention #1009

Merged
merged 32 commits into from
Dec 5, 2024
Merged

Conversation

sywangyi
Copy link
Collaborator

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

sywangyi and others added 14 commits October 8, 2024 22:57
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* refine class IPEXPagedCache's update method

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* replace tensor on xpu to List to avoid memory copy

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode`

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* enable qkv

* split key value into 2 lists
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
#979)

* enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention

* enable new_decoder_arch falcon

* only keep 1 config

* rm autocast
* fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain`

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* add LinearGelu Op support for XPU

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* fix unit test error

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* adjust unit test case

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* fix bug

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* skip assited decoding unit test for models using paged attention

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* XPU CI tests get almost all passed

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi changed the title Paged attn unify xpu and cpu backend and use paged attention Nov 22, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix ci config

* fix test versions

* fix ipex version

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@sywangyi sywangyi marked this pull request as draft November 22, 2024 01:34
* use python3.9 test

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@sywangyi sywangyi marked this pull request as ready for review November 22, 2024 03:00
jiqing-feng and others added 2 commits November 22, 2024 13:11
* change ipex transformers limited verison in setup
* fix inc tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@yao-matrix
Copy link

@IlyasMoutawwakil @echarlaix , pls help review, we can also have a meeting to review it if needed. Thx.

* fix bert and vit patch
* fix vit and bert save


Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@IlyasMoutawwakil
Copy link
Member

@yao-matrix reviewing right now

@jiqing-feng
Copy link
Collaborator

Hi @IlyasMoutawwakil , please also merge this PR #1024. Thanks!

@jiqing-feng
Copy link
Collaborator

Hi @IlyasMoutawwakil . I have replied and fixed your comments, please take the 2nd round review. Thanks~

* simplify forward and save pretrained since no jit support

* fix format

* rm warmup because no jit mode anymore

* simplify forward for causal lm model

* fix paged pkv  forward

* disable use_cache when just run forward

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

if isinstance(model, torch.jit.RecursiveScriptModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TorchScript models will not be compatible anymore which is an important breaking change, we need to catch this to inform users

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we need to update the documentation

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
task = cls.export_feature
model = TasksManager.get_model_from_task(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use cls.auto_model_class ?

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* nice code
* device type adjustment

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@changwangss changwangss mentioned this pull request Nov 27, 2024
3 tasks
* enable compile for non-generation tasks
* add no_grad in forward
* warmup compiled model
* disable compile not ready models
* set system level optimize for torch.compile
* fix typo
* add comments
* set torch minimum version for compiling

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Collaborator

Hi @echarlaix @IlyasMoutawwakil , please review the new changes. Thanks

)
return TSModelForCausalLM.from_pretrained(model_id, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An instance of TSModelForCausalLM will be created for all IPEXModel (even for encoder models) which doesn't really make sense to me. Also it's not tested anywhere from what I see, I prefer to raise an error here instead of keeping support that we're not sure works / is compatible with the previous integration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

tests/ipex/test_modeling.py Show resolved Hide resolved

return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
model = cls.auto_model_class.from_pretrained(model_id, **kwargs)
return cls(model, config=model.config, export=True, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would export be needed ?

Suggested change
return cls(model, config=model.config, export=True, **kwargs)
return cls(model, config=model.config, **kwargs)

Copy link
Collaborator

@jiqing-feng jiqing-feng Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


if isinstance(model, torch.jit.RecursiveScriptModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we need to update the documentation

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

sywangyi and others added 2 commits December 2, 2024 10:11
* fix readme and push to hub support

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* rm export in tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* test with torch 2.5.*

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Collaborator

Hi @echarlaix @IlyasMoutawwakil . Please review the new changes.

You can load your model and apply IPEX optimizations (including weight prepacking and graph mode). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators.
For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.
You can load your model and apply IPEX optimizations (apply torch.compile for non-generation tasks). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators.
For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated.
For now, support is enabled for Intel CPU/GPU. Previous models converted to TorchScript will be deprecated in v1.22.

tests/ipex/test_modeling.py Show resolved Hide resolved
Comment on lines 135 to 137
dtype = torch.float32
if IS_XPU:
dtype = torch.float16
Copy link
Collaborator

@echarlaix echarlaix Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtype = torch.float32
if IS_XPU:
dtype = torch.float16
dtype = torch.float16 if IS_XPU_AVAILABLE else torch.float32



IS_XPU = is_torch_xpu_available(check_device=True)
Copy link
Collaborator

@echarlaix echarlaix Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IS_XPU = is_torch_xpu_available(check_device=True)
IS_XPU_AVAILABLE = is_torch_xpu_available(check_device=True)

@@ -56,7 +59,6 @@ class PipelinesIntegrationTest(unittest.TestCase):
"gpt2",
"gpt_neo",
"gpt_neox",
"llama",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep it ?

Comment on lines 497 to 498
@unittest.skipIf(is_ipex_version("<", "2.3.0"), reason="Only ipex version > 2.3.0 supports ipex model patching")
def test_patched_model(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this test ? (using a new model)

if IS_XPU:
dtype = torch.float16
# Test model forward do not need cache.
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to test default here :

Suggested change
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False)
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype)

Comment on lines 221 to 224
"llama",
"llama2",
# "phi",
"distilgpt2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove llama and distilgpt2 test ?

* fix tests
* fix typo
* add patched tests

* change forward to generate

* fix tests

* fix test model name


---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Collaborator

jiqing-feng commented Dec 3, 2024

Hi @echarlaix. I have fixed all your comments, please take a review.

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

The only way to support forward without pask_key_value in the inputs when use_cache=True is to create a cache class in forward, but it's not reasonable because generate already created the cache class. I would like to hear your opinion. Thanks!!

@jiqing-feng
Copy link
Collaborator

Hi @echarlaix . We plan to merge this PR this year which means we need the review done before Xmas. Appreciate it if you could prioritize this PR. Thanks!

@IlyasMoutawwakil
Copy link
Member

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

@jiqing-feng
Copy link
Collaborator

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

Right, I will follow it to see if it could work in ipex patching models. Thanks.

@sywangyi
Copy link
Collaborator Author

sywangyi commented Dec 5, 2024

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

because if past_key_values is None and use_cache = True. DynamicCache will be created in modeling. see
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L863, as an example. we do not maintain dynamicCache logic in optimum-intel ipex modeling now. ipex modeling only support pagedCache.

* fix forward without pkv
* patch gpt2 block forward
* fix typo
* revert causal lm tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Collaborator

Hi @sywangyi . The use_cache is from generation_config, which means it's just a parameter for generation and should not block the model forward without past_key_values. We can only guarantee the inputs have a cache in generation if use_cache=True, but not for just calling forward.

Hi @IlyasMoutawwakil . I have fixed it by your comment, please check if there are any changes required before merging. Thanks!

@IlyasMoutawwakil IlyasMoutawwakil merged commit 41f0a46 into main Dec 5, 2024
37 of 45 checks passed
@IlyasMoutawwakil IlyasMoutawwakil deleted the paged_attn branch December 5, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants