Skip to content

Conversation

rahul-tuli
Copy link
Member

@rahul-tuli rahul-tuli commented Sep 24, 2025

Summary

This PR enables Eagle3 speculative decoding with Llama3 drafter and Llama4 multimodal verifier support, with configurable auxiliary hidden state layers.

Key Features

  • Compatibility: Llama3 Eagle3 drafter can now work with Llama4 verifier models
  • Configurable auxiliary layers: Hidden state layer indices can be specified via eagle_aux_hidden_state_layer_ids in the speculator config, allowing non-default layer selection for optimal performance across different model architectures

Configuration

Auxiliary layer indices can be set in the Eagle3 draft model config:

{
  "eagle_aux_hidden_state_layer_ids": [1, 23, 44]
}

This enables using hidden states from non-default layers (e.g., layers 1, 23, 44 instead of default 2, 23, 44) for cross-architecture scenarios where different layer combinations may work better.

Testing

Command:

python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 8 \
  --print-output \
  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 3

Results:

  • Mean acceptance length: 2.53
  • Per-position acceptance rates: 0.71, 0.48, 0.34
  • Auxiliary layers used: [1, 23, 44] (configured via speculator config)
--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 227215
num_drafts: 90393
num_draft_tokens: 271179
num_accepted_tokens: 136677
mean acceptance length: 2.53
--------------------------------------------------
acceptance at token 0: 0.71
acceptance at token 1: 0.48
acceptance at token 2: 0.34

@rahul-tuli rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 10 times, most recently from cf02c8d to 1695608 Compare September 30, 2025 15:30
@rahul-tuli rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 2 times, most recently from 224ec40 to 1f6fd40 Compare October 3, 2025 08:40
rahul-tuli and others added 8 commits October 3, 2025 08:44
Support configuring eagle_aux_hidden_state_layer_ids and inference_type
in the Eagle3 speculator configuration. This allows users to specify
which verifier layers should output auxiliary hidden states for the
drafter to consume during speculative decoding.

Signed-off-by: rahul-tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Add documentation explaining that get_eagle3_aux_hidden_state_layers()
provides default layer selection and that the GPU model runner can
override this with values from speculative config for dynamic
configuration.

Signed-off-by: rahul-tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Add Eagle3 support to Llama4ForConditionalGeneration by implementing
set_aux_hidden_state_layers() and get_eagle3_aux_hidden_state_layers()
methods. Both methods delegate to the underlying Llama4ForCausalLM
language model, enabling Eagle3 speculative decoding with Llama4
multimodal verifier models.

This allows text-only Eagle3 drafters to work with Llama4 multimodal
verifiers by consuming auxiliary hidden states from specified layers.

Signed-off-by: rahul-tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Implement custom get_input_embeddings() in Eagle3LlamaForCausalLM that
accepts multimodal parameters but only processes text embeddings. This
ensures the Llama3-based Eagle3 drafter correctly handles text inputs
while remaining compatible with multimodal verifier interfaces.

The drafter receives multimodal context through auxiliary hidden states
from the verifier rather than processing multimodal inputs directly.

Signed-off-by: rahul-tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Implement _get_eagle3_aux_layers_from_config() helper method to extract
auxiliary layer IDs from the draft model's speculative config. The GPU
model runner now prefers config-specified layers over model defaults,
with fallback to model's get_eagle3_aux_hidden_state_layers() when not
configured.

Changes:
- Refactor auxiliary layer setup with early return pattern for errors
- Add config extraction with proper error handling
- Log only when using non-default layer configuration
- Enable dynamic layer configuration per deployment

Signed-off-by: rahul-tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch from 1f6fd40 to 5e93541 Compare October 3, 2025 08:45
huijjj and others added 5 commits October 3, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants