Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom mask slows down attention. #724

Open
qiyuxinlin opened this issue Jan 8, 2025 · 2 comments
Open

Custom mask slows down attention. #724

qiyuxinlin opened this issue Jan 8, 2025 · 2 comments

Comments

@qiyuxinlin
Copy link

I noticed that in your previous version, you converted the float-type mask into a bit-packed array for mask usage. I would like to ask how much time this approach saves? I tested the execution time of the bit-packed array mask and the casual kernel, and I found that it runs about twice as slow. This still seems like a significant overhead.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 8, 2025

Yes custom mask has significant overhead, the memory access pattern to custom mask is not coalesced.

For long sequence, it's encouraged to use the sparse API instead:
https://docs.flashinfer.ai/api/sparse.html

@ZhongYingMatrix
Copy link

Is there a better way to support tree attention (speculative decoding) than with a custom mask?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants