Fix block allocation for multi-token decode (speculative decoding) by brucechanglongxu · Pull Request #250 · ROCm/ATOM

brucechanglongxu · 2026-03-01T07:44:20Z

BlockManager.can_append and may_append assumed at most 1 new token per decode step. With speculative decoding (mtp_k > 0), the scheduler generates mtp_k + 1 tokens per step, which can cross multiple block boundaries. The old code under-allocated KV cache blocks.

The old can_append used a boolean expression — (len(seq) % block_size == 1) evaluates to True/False, so it checked for at most 1 free block regardless of how many tokens were about to be generated. Fixed to accept num_new_tokens and compute the exact block deficit.

may_append had a similar issue: needed_blocks = (seq_len + block_size - 1) // block_size only accounted for the current sequence length, not the tokens about to be generated. The elif seq_len % block_size == 0 branch also only updated the hash but never allocated new blocks. Restructured into two phases: (1) register the hash if the last block just became full, (2) allocate blocks for ceil((seq_len + num_new_tokens) / block_size).

On the scheduler side, hoisted num_new_tokens = self.mtp_k + 1 before the can_append loop so the check uses the correct token count. Also initialized num_rejected = 0 before the speculative decoding branch in postprocess to fix an UnboundLocalError on the non-speculative path.

Added 15 new GPU-free unit tests covering multi-token can_append/may_append scenarios (boundary conditions, exact-fit, insufficient blocks) and prefix caching during decode (hash registration, cache reuse by new sequences, multi-step prefix building). All 87 tests pass.

Test plan:

python -m pytest tests/ --ignore=tests/test_utils.py --ignore=tests/test_envs.py — 87/87 pass
Speculative decoding inference with mtp_k > 0 and prefix caching enabled needs GPU validation
Standard (non-speculative) inference should be regression-free

can_append and may_append assumed at most 1 new token per decode step. With speculative decoding (mtp_k > 0), the scheduler generates mtp_k + 1 tokens per step, which can cross multiple block boundaries. The old code under-allocated blocks, leading to out-of-bounds KV cache writes. can_append: - Old: boolean expression (len(seq) % block_size == 1) checked for 0 or 1 free blocks regardless of how many tokens are about to be generated. - New: accepts num_new_tokens, computes exact block deficit for the upcoming tokens. may_append: - Old: needed_blocks = ceil(seq_len / block_size) — only accounted for current sequence length, not the tokens about to be generated. Also, the elif branch at seq_len % block_size == 0 only updated the hash but did not allocate new blocks for upcoming tokens. - New: two-phase approach — (1) register hash if the last block just became full, (2) allocate blocks for ceil((seq_len + num_new_tokens) / block_size). Scheduler: - Hoisted num_new_tokens = mtp_k + 1 before the can_append loop so the check uses the correct token count. - Initialized num_rejected = 0 before the speculative decoding branch in postprocess to fix an UnboundLocalError on the non-speculative path. Tests: - Added 8 new can_append tests covering block boundaries, multi-token allocation, exact-fit, and insufficient-block scenarios. - Added 4 new may_append tests for multi-token allocation, boundary hash registration, and block_size=1 with multiple tokens. - Added TestPrefixCachingDecode class (3 tests): hash registration during decode, cache reuse by new sequences, and multi-step prefix building across decode iterations. - Fixed ScheduledBatchOutput constructor calls in test_scheduler.py to include num_rejected and num_bonus parameters.

ChuanLi1101 requested a review from valarLip March 2, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix block allocation for multi-token decode (speculative decoding)#250

Fix block allocation for multi-token decode (speculative decoding)#250
brucechanglongxu wants to merge 1 commit intoROCm:mainfrom
brucechanglongxu:fix/block-manager-multi-token-allocation

brucechanglongxu commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brucechanglongxu commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brucechanglongxu commented Mar 1, 2026 •

edited

Loading