fix: resolve prefix caching crashes with MTP speculative decoding#234
Draft
fix: resolve prefix caching crashes with MTP speculative decoding#234
Conversation
Fix GPU memory access fault caused by double conversion of block_tables in cached prefill path. kv_indices_generate_triton applies block_ratio internally, but was receiving already-converted block_tables (via block_tables_converted), causing indices to be multiplied by block_ratio twice (e.g. block_id*256 instead of block_id*16), exceeding KV cache bounds. Key changes: - Use raw block_tables for kv_indices generation in aiter_mla prefill - Route cached prefill through paged MLA attention (supports Q≠K) instead of flash_attn_varlen_func (requires Q==K) - Track has_cached flag through AttentionMetaData for path selection - Fix block_manager: hash table leak, can_allocate cache-hit accounting, can_append for multi-token decode, O(1) free block tracking - Add CacheStats to scheduler for prefix cache hit rate monitoring - Add comprehensive block_manager tests (119 passing) Verified: gsm8k 1319 samples, 95.83% accuracy, 0 GPU faults.
| from atom.model_engine.scheduler import ScheduledBatch | ||
|
|
||
| logger = logging.getLogger("atom") | ||
| from atom.model_ops.attention_mla import MLAModules |
Contributor
|
|
||
| logger = logging.getLogger("atom") | ||
| from atom.model_ops.attention_mla import MLAModules | ||
| from atom.utils import CpuGpuBuffer |
Contributor
| logger = logging.getLogger("atom") | ||
| from atom.model_ops.attention_mla import MLAModules | ||
| from atom.utils import CpuGpuBuffer | ||
| from atom.utils.block_convert import block_table_convert_triton |
Contributor
Comment on lines
+13
to
+14
| import json | ||
| import re |
Contributor
| sys.exit(1) | ||
|
|
||
| model = get_model_name(base_url) | ||
| print(f"=== Prefix Cache Accuracy Test ===") |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix GPU memory access fault caused by double conversion of block_tables in cached prefill path. kv_indices_generate_triton applies block_ratio internally, but was receiving already-converted block_tables (via block_tables_converted), causing indices to be multiplied by block_ratio twice (e.g. block_id256 instead of block_id16), exceeding KV cache bounds.
Key changes:
Verified: gsm8k 1319 samples, 95.83% accuracy, 0 GPU faults.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist