Support llama3 eagle3 head with llama4 verifier #117

rahul-tuli · 2025-09-24T13:40:02Z

Summary

This PR enables Eagle3 speculative decoding with Llama3 drafter and Llama4 multimodal verifier support, with configurable auxiliary hidden state layers.

Key Features

Compatibility: Llama3 Eagle3 drafter can now work with Llama4 verifier models
Configurable auxiliary layers: Hidden state layer indices can be specified via eagle_aux_hidden_state_layer_ids in the speculator config, allowing non-default layer selection for optimal performance across different model architectures

Configuration

Auxiliary layer indices can be set in the Eagle3 draft model config:

{
  "eagle_aux_hidden_state_layer_ids": [1, 23, 44]
}

This enables using hidden states from non-default layers (e.g., layers 1, 23, 44 instead of default 2, 23, 44) for cross-architecture scenarios where different layer combinations may work better.

Testing

Command:

python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 8 \
  --print-output \
  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 3

Results:

Mean acceptance length: 2.53
Per-position acceptance rates: 0.71, 0.48, 0.34
Auxiliary layers used: [1, 23, 44] (configured via speculator config)

--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 227215
num_drafts: 90393
num_draft_tokens: 271179
num_accepted_tokens: 136677
mean acceptance length: 2.53
--------------------------------------------------
acceptance at token 0: 0.71
acceptance at token 1: 0.48
acceptance at token 2: 0.34

Support configuring eagle_aux_hidden_state_layer_ids and inference_type in the Eagle3 speculator configuration. This allows users to specify which verifier layers should output auxiliary hidden states for the drafter to consume during speculative decoding. Signed-off-by: rahul-tuli <rtuli@redhat.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Add documentation explaining that get_eagle3_aux_hidden_state_layers() provides default layer selection and that the GPU model runner can override this with values from speculative config for dynamic configuration. Signed-off-by: rahul-tuli <rtuli@redhat.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Add Eagle3 support to Llama4ForConditionalGeneration by implementing set_aux_hidden_state_layers() and get_eagle3_aux_hidden_state_layers() methods. Both methods delegate to the underlying Llama4ForCausalLM language model, enabling Eagle3 speculative decoding with Llama4 multimodal verifier models. This allows text-only Eagle3 drafters to work with Llama4 multimodal verifiers by consuming auxiliary hidden states from specified layers. Signed-off-by: rahul-tuli <rtuli@redhat.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Implement custom get_input_embeddings() in Eagle3LlamaForCausalLM that accepts multimodal parameters but only processes text embeddings. This ensures the Llama3-based Eagle3 drafter correctly handles text inputs while remaining compatible with multimodal verifier interfaces. The drafter receives multimodal context through auxiliary hidden states from the verifier rather than processing multimodal inputs directly. Signed-off-by: rahul-tuli <rtuli@redhat.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Implement _get_eagle3_aux_layers_from_config() helper method to extract auxiliary layer IDs from the draft model's speculative config. The GPU model runner now prefers config-specified layers over model defaults, with fallback to model's get_eagle3_aux_hidden_state_layers() when not configured. Changes: - Refactor auxiliary layer setup with early return pattern for errors - Add config extraction with proper error handling - Log only when using non-default layer configuration - Enable dynamic layer configuration per deployment Signed-off-by: rahul-tuli <rtuli@redhat.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Signed-off-by: huijjj <huijong.jeong@squeezebits.com>

…llm-project#26153) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 10 times, most recently from cf02c8d to 1695608 Compare September 30, 2025 15:30

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 2 times, most recently from 224ec40 to 1f6fd40 Compare October 3, 2025 08:40

rahul-tuli and others added 8 commits October 3, 2025 08:44

Update vllm/model_executor/models/mllama4.py

1903535

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Review comments

2158396

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Use get_input_embeddings

5e93541

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch from 1f6fd40 to 5e93541 Compare October 3, 2025 08:45

huijjj and others added 5 commits October 3, 2025 08:56

add(v1): RequestStatesStats to RequestOutput (vllm-project#24947)

3e70e3d

Signed-off-by: huijjj <huijong.jeong@squeezebits.com>

[Model] Use merge_by_field_config for MM models (InternVL family) (v…

f9a8084

…llm-project#26153) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[test utils] correct wrong typing (vllm-project#26159)

5446ad1

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

[CI] Fix distributed hybrid tests in CI (vllm-project#26155)

0e93ac0

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Merge branch 'main' into support-llama3-eagle3-head-with-llama4-verifier

cac1941

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama3 eagle3 head with llama4 verifier #117

Support llama3 eagle3 head with llama4 verifier #117

Uh oh!

rahul-tuli commented Sep 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Support llama3 eagle3 head with llama4 verifier #117

Are you sure you want to change the base?

Support llama3 eagle3 head with llama4 verifier #117

Uh oh!

Conversation

rahul-tuli commented Sep 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Configuration

Testing

Uh oh!

Uh oh!

rahul-tuli commented Sep 24, 2025 •

edited by github-actions bot

Loading