-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kernel cache miss and add RDNA configs #246
Conversation
hyoon1
commented
Oct 25, 2024
- added Navi configurations (Related PR: add RDNA Config triton#640)
- resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0
@@ -795,8 +880,8 @@ def forward( | |||
HQ=nheads_q, | |||
HK=nheads_k, | |||
ACTUAL_BLOCK_DMODEL=head_size, | |||
MAX_SEQLENS_Q=max_seqlens_q, | |||
MAX_SEQLENS_K=max_seqlens_k, | |||
MAX_SEQLENS_Q=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to zero seq lens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below attention fwd kernel is called when we run the model with vllm:
attn_fwd[grid]( |
However, MAX_SEQLENS_Q/K differs every step, and it occurs different key value and compilation for the triton kernel each step, which leads to the performance degradation.
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L620
https://github.com/triton-lang/triton/blob/cf34004b8a67d290a962da166f5aa2fc66751326/python/triton/runtime/jit.py#L660
Currently, VARLEN is always set, and MAX_SEQLENS_Q/K are not used in this case when you look at the kernel in vllm.
def attn_fwd( |
Therefore, we just set MAX_SEQLENS_Q/K as a fixed value when we call the kernel for a workaround.
@@ -207,103 +209,186 @@ def _attn_fwd_inner( | |||
return acc, l_i, m_i | |||
|
|||
|
|||
@triton.autotune( | |||
configs=[ | |||
def get_gfx_version(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like not used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Removed it.
).arch in ('gfx940', 'gfx941', 'gfx942', 'gfx90a', 'gfx908') | ||
|
||
|
||
def is_rdna(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth to use:
Line 1620 in 8f3bf8b
def is_navi() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
return triton.runtime.driver.active.get_current_target().backend == "hip" | ||
|
||
|
||
def is_cdna(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my knowledge AMD has two lines of HW for vllm: MI and Navi. So not navi
should work better for future generations of MIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
As per @gshtras we need to merge into develop branch instead of main for now. Please correct. |
return None | ||
|
||
|
||
def is_hip(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this functionality is implemented in a cross-architecture fashion in the platform/rocm.py and its superclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@maleksan85 @gshtras Secondly, our team is using the v0.6.2+rocm release, and I understand that functions like is_navi() are not supported in that version. Implementing them would require significant modifications. Therefore, maintaining backward compatibility is also a concern. Given these considerations, I would greatly appreciate your advice on how to proceed with the modifications. |
As for your last point, whatever changes will be made here will not have any effect on the previous tags, so v0.6.2+rocm will not get affected. |
3f81ad2
to
4cc77c2
Compare
@@ -207,103 +209,149 @@ def _attn_fwd_inner( | |||
return acc, l_i, m_i | |||
|
|||
|
|||
@triton.autotune( | |||
configs=[ | |||
def get_cdna_autotune_configs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that those new commits will not decrease performance on MI. If so, what models did you tested?
cc @gshtras
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested on Navi31. I thought it was tested by triton team for other models because they modified configs for better performance. https://github.com/ROCm/triton/blob/db2ca015159c6592c30a6bfcd77b9cc540063a8e/python/perf-kernels/flash-attention.py#L334
Beside those configs for autoconfig, I believe fixing MAX_SEQLENS_Q/K to 0 will increase the performance for MI as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tested chatglm2-6b, qwen-14b-chat, baichuan2-13b, llama-2-70b-chat, glm-4-9b-chat, qwen1.5-72b-chat-gptq, etc. on Navi31, w/o this change, triton-based FA2 has no positive perf lifting; while with this change, triton-based FA2 shows 2-5% gain. (and by debugging, it is confirmed that triton FA2 kernel cache is missed). We believe this should also provide positive impact on MI, especially during early triton kernel cache built-up period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed in the chat to separate things for MI from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored autotune configs for MI series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyoon1, could you please make this change only applicable to Navi? I will ask engineers in China to confirm the perf gain on Navi32 (although such cache misses issue has no dependencies on what GPU used). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. MI will use original configs for autotune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional chatglm3-6b throughput test result on Navi 32 (16gb)
use triton / num-prompts 512 / max-model-len 512
original: input: 1234.33 toks/s, output: 921.11 toks/s Throughput: 5.31 requests/s, 2544.00 tokens/s
w/ update: input: 1386.34 toks/s, output: 1034.54 toks/s Throughput: 5.96 requests/s, 2856.15 tokens/s
- added Navi configurations (Related PR: ROCm/triton#640) - resolved cache miss issue during flash attention calls by fixing max_seqlen_q/k to 0
Please try to avoid force pushes after the initial reviews. It makes it impossible to see the new changes. |