Skip to content

Mtp draft fix#254

Merged
valarLip merged 8 commits intomainfrom
mtp_draft_fix
Mar 3, 2026
Merged

Mtp draft fix#254
valarLip merged 8 commits intomainfrom
mtp_draft_fix

Conversation

@valarLip
Copy link
Collaborator

@valarLip valarLip commented Mar 2, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@valarLip valarLip marked this pull request as ready for review March 2, 2026 16:41
Copilot AI review requested due to automatic review settings March 2, 2026 16:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines speculative decoding (MTP/EAGLE) execution and attention metadata handling, while also restructuring scheduler/model-runner output plumbing and MTP statistics reporting.

Changes:

  • Updates attention metadata preparation (slot mapping initialization, kv_indices generation/buffer sizing) to better support speculative decoding paths.
  • Refactors scheduler/model-runner output formats to use ordered req_ids + token_ids lists with O(1) req-id indexing.
  • Revises MTP stats logging behavior and routes stats printing through Scheduler.spec_stats.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
atom/utils/forward_context.py Removes unused fake_block_tables from AttentionMetaData.
atom/spec_decode/eagle.py Adjusts speculative proposer position/index handling and updates attention metadata for MTP decode.
atom/models/deepseek_mtp.py Disables masked embedding behavior and comments out support_torch_compile usage.
atom/model_ops/sampler.py Introduces cached exponential tensor helper for sampling path.
atom/model_ops/embed_head.py Minor import reordering.
atom/model_ops/attentions/backends.py Initializes slot mapping with -1 for scheduled tokens; copies full scheduled range to GPU.
atom/model_ops/attentions/aiter_mla.py Increases kv_indices buffer sizing and generates kv_indices via Triton; various decode/prefill path adjustments.
atom/model_ops/attentions/aiter_attention.py Similar kv_indices buffer sizing and generation changes for persistent attention.
atom/model_ops/attention_mha.py Removes prefill-time fake block table handling.
atom/model_engine/scheduler.py Changes MTP stats logging cadence; refactors ScheduledBatchOutput structure and adds O(1) req-id lookup.
atom/model_engine/model_runner.py Batch token-id postprocessing; adapts to new ScheduledBatchOutput API; removes old MTP stats APIs.
atom/model_engine/engine_core.py Prints MTP stats via scheduler instead of runner RPC.
atom/model_engine/async_proc.py Avoids enqueueing outputs when no output address is configured.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 3, 2026 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

atom/model_engine/scheduler.py:416

  • num_rejected is only assigned inside if is_deferred_out or self.use_spec:, but it's used later unconditionally when computing num_tokens = seq.num_tokens - self.mtp_k - num_rejected. In the non-speculative, non-deferred path this will raise UnboundLocalError (or reuse a stale value from a previous loop iteration). Initialize num_rejected = 0 per-sequence (or compute num_tokens differently) so the non-spec path is safe.
            if self.mtp_k > 0:
                # idx already resolved above via get_idx
                seq.spec_token_ids = draft_token_ids[idx]

            if seq.num_completion_tokens == 1 and seq.first_token_time == 0.0:
                seq.first_token_time = time.time()

            num_tokens = seq.num_tokens - self.mtp_k - num_rejected
            leave_reason = None

atom/model_engine/engine_core.py:301

  • print_mtp_statistics() now calls the private SpecStats._log() unconditionally when spec_stats exists. _log() divides by iv_steps, which will be 0 if no decode steps have been recorded yet, causing a ZeroDivisionError. Please add a guard (e.g., if spec_stats.total_draft_tokens > 0 / total_steps > 0) or expose a safe public logging method on SpecStats that handles the empty case.
    def print_mtp_statistics(self):
        if self.scheduler.spec_stats is not None:
            self.scheduler.spec_stats._log()
        else:
            logger.info(
                "\n[MTP Stats] No MTP statistics available (MTP not enabled or no tokens processed)\n"
            )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@valarLip valarLip merged commit 33e0aac into main Mar 3, 2026
18 checks passed
@valarLip valarLip deleted the mtp_draft_fix branch March 3, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants