-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unify xpu and cpu backend and use paged attention #1009
Conversation
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* refine class IPEXPagedCache's update method Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * replace tensor on xpu to List to avoid memory copy Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* enable qkv * split key value into 2 lists
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast
* fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add LinearGelu Op support for XPU Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix unit test error Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust unit test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* skip assited decoding unit test for models using paged attention Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * XPU CI tests get almost all passed Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* use python3.9 test Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@IlyasMoutawwakil @echarlaix , pls help review, we can also have a meeting to review it if needed. Thx. |
* fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@yao-matrix reviewing right now |
Hi @IlyasMoutawwakil , please also merge this PR #1024. Thanks! |
Hi @IlyasMoutawwakil . I have replied and fixed your comments, please take the 2nd round review. Thanks~ |
* simplify forward and save pretrained since no jit support * fix format * rm warmup because no jit mode anymore * simplify forward for causal lm model * fix paged pkv forward * disable use_cache when just run forward --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
||
if isinstance(model, torch.jit.RecursiveScriptModule): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TorchScript models will not be compatible anymore which is an important breaking change, we need to catch this to inform users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we need to update the documentation
For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
optimum/intel/ipex/modeling_base.py
Outdated
|
||
return cls(model, config=config, model_save_dir=model_save_dir, **kwargs) | ||
task = cls.export_feature | ||
model = TasksManager.get_model_from_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use cls.auto_model_class
?
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* nice code * device type adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
* enable compile for non-generation tasks * add no_grad in forward * warmup compiled model * disable compile not ready models * set system level optimize for torch.compile * fix typo * add comments * set torch minimum version for compiling Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Hi @echarlaix @IlyasMoutawwakil , please review the new changes. Thanks |
optimum/intel/ipex/modeling_base.py
Outdated
) | ||
return TSModelForCausalLM.from_pretrained(model_id, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An instance of TSModelForCausalLM
will be created for all IPEXModel
(even for encoder models) which doesn't really make sense to me. Also it's not tested anywhere from what I see, I prefer to raise an error here instead of keeping support that we're not sure works / is compatible with the previous integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
optimum/intel/ipex/modeling_base.py
Outdated
|
||
return cls(model, config=config, model_save_dir=model_save_dir, **kwargs) | ||
model = cls.auto_model_class.from_pretrained(model_id, **kwargs) | ||
return cls(model, config=model.config, export=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would export be needed ?
return cls(model, config=model.config, export=True, **kwargs) | |
return cls(model, config=model.config, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
if isinstance(model, torch.jit.RecursiveScriptModule): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we need to update the documentation
For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated. |
* fix readme and push to hub support Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * rm export in tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * test with torch 2.5.* Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Hi @echarlaix @IlyasMoutawwakil . Please review the new changes. |
docs/source/ipex/inference.mdx
Outdated
You can load your model and apply IPEX optimizations (including weight prepacking and graph mode). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators. | ||
For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated. | ||
You can load your model and apply IPEX optimizations (apply torch.compile for non-generation tasks). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators. | ||
For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated. | |
For now, support is enabled for Intel CPU/GPU. Previous models converted to TorchScript will be deprecated in v1.22. |
tests/ipex/test_pipelines.py
Outdated
dtype = torch.float32 | ||
if IS_XPU: | ||
dtype = torch.float16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype = torch.float32 | |
if IS_XPU: | |
dtype = torch.float16 | |
dtype = torch.float16 if IS_XPU_AVAILABLE else torch.float32 |
tests/ipex/utils_tests.py
Outdated
|
||
|
||
IS_XPU = is_torch_xpu_available(check_device=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS_XPU = is_torch_xpu_available(check_device=True) | |
IS_XPU_AVAILABLE = is_torch_xpu_available(check_device=True) |
tests/ipex/test_pipelines.py
Outdated
@@ -56,7 +59,6 @@ class PipelinesIntegrationTest(unittest.TestCase): | |||
"gpt2", | |||
"gpt_neo", | |||
"gpt_neox", | |||
"llama", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not keep it ?
tests/ipex/test_modeling.py
Outdated
@unittest.skipIf(is_ipex_version("<", "2.3.0"), reason="Only ipex version > 2.3.0 supports ipex model patching") | ||
def test_patched_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep this test ? (using a new model)
tests/ipex/test_modeling.py
Outdated
if IS_XPU: | ||
dtype = torch.float16 | ||
# Test model forward do not need cache. | ||
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to test default here :
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False) | |
ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype) |
tests/ipex/test_modeling.py
Outdated
"llama", | ||
"llama2", | ||
# "phi", | ||
"distilgpt2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove llama and distilgpt2 test ?
* fix tests * fix typo * add patched tests * change forward to generate * fix tests * fix test model name --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Hi @echarlaix. I have fixed all your comments, please take a review. For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None. The only way to support forward without pask_key_value in the inputs when use_cache=True is to create a cache class in forward, but it's not reasonable because generate already created the cache class. I would like to hear your opinion. Thanks!! |
Hi @echarlaix . We plan to merge this PR this year which means we need the review done before Xmas. Appreciate it if you could prioritize this PR. Thanks! |
I'm pretty sure all calls to |
Right, I will follow it to see if it could work in ipex patching models. Thanks. |
because if past_key_values is None and use_cache = True. DynamicCache will be created in modeling. see |
* fix forward without pkv * patch gpt2 block forward * fix typo * revert causal lm tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Hi @sywangyi . The Hi @IlyasMoutawwakil . I have fixed it by your comment, please check if there are any changes required before merging. Thanks! |
What does this PR do?
Fixes # (issue)
Before submitting