Custom mask slows down attention. #724

qiyuxinlin · 2025-01-08T14:19:25Z

I noticed that in your previous version, you converted the float-type mask into a bit-packed array for mask usage. I would like to ask how much time this approach saves? I tested the execution time of the bit-packed array mask and the casual kernel, and I found that it runs about twice as slow. This still seems like a significant overhead.

yzh119 · 2025-01-08T19:41:01Z

Yes custom mask has significant overhead, the memory access pattern to custom mask is not coalesced.

For long sequence, it's encouraged to use the sparse API instead:
https://docs.flashinfer.ai/api/sparse.html

ZhongYingMatrix · 2025-01-10T08:27:03Z

Is there a better way to support tree attention (speculative decoding) than with a custom mask?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom mask slows down attention. #724

Custom mask slows down attention. #724

qiyuxinlin commented Jan 8, 2025

yzh119 commented Jan 8, 2025

ZhongYingMatrix commented Jan 10, 2025

Custom mask slows down attention. #724

Custom mask slows down attention. #724

Comments

qiyuxinlin commented Jan 8, 2025

yzh119 commented Jan 8, 2025

ZhongYingMatrix commented Jan 10, 2025