-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DeepseekR1]How ragged prefill manage kv_cache? #3849
Comments
Hi @AlvL1225 , kv cache in flashinfer backend is provided by forward batch in the form of For Deepseek v3/r1 models, the logic of flashinfer mla has been moved to |
Hi @Fridge003 , thank you for your reply! ![]() During chunk prefill, we use Multi-Head Attention (MHA) with 192 dimensions for query and key, and 128 dimensions for value and output. However, the cache is saved in a compressed format as "compressed_kv" (512 dimensions for non-rotary position embeddings and 64 dimensions for rotary position embeddings in keys). When executing chunked prefill operations, do we decompress "lora_kv" every time, or is the "MHA kv cache" used directly at some point in the process? |
Hi @AlvL1225 , the ragged part of flashinfer mla backend is still under developing, so the code is unfinished. Please stay tuned for related PR in two or three days. For flashinfer mla backend, paged prefilling(prefiling with prefix) will call |
I'm investigating the chunked prefill method in DeepSeek V3/R1. The code shows that it uses self.prefill_wrapper_ragged.forward_return_lse for both prefill and chunked prefill operations. However, I haven't been able to locate where the KV cache is provided in the code. Could you help me identify this part of the implementation?
The text was updated successfully, but these errors were encountered: