Skip to content

Conversation

rahul-tuli
Copy link
Member

@rahul-tuli rahul-tuli commented Sep 26, 2025

VLLM_USE_V1=1 \
  CUDA_VISIBLE_DEVICES=6 \
  python examples/offline_inference/spec_decode.py \
    --method "eagle3" \
    --tp 1 \
    --print-output \
    --model-dir "Qwen/Qwen2.5-VL-7B-Instruct" \
    --eagle-dir "nm-testing/MOCK-UP-Eagle3ForQwen2.5VL7B" \
    --dataset_name "hf" \
    --dataset_path "philschmid/mt-bench" \
    --num-spec-tokens 3

Results:

--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 198111
num_drafts: 143816
num_draft_tokens: 431448
num_accepted_tokens: 53491
mean acceptance length: 1.37
--------------------------------------------------
acceptance at token 0: 0.31
acceptance at token 1: 0.06
acceptance at token 2: 0.01

- Add SupportsEagle3 interface to Llama4ForConditionalGeneration and Llama4ForCausalLM
- Implement custom auxiliary hidden state layers (1, 23, 44) for Eagle3 speculative decoding
- Enable multimodal input handling in Eagle3LlamaForCausalLM with text-only inference mode
- Add proper dimension adaptation for auxiliary hidden states from multimodal verifiers
- Implement dynamic Eagle3 auxiliary layer configuration from speculators config
- Add GPU model runner method to read eagle_aux_hidden_state_layer_ids from draft config
- Update auxiliary layer configuration logic to use speculative config dynamically
- Simplify model implementations to provide fallback defaults

This is the first successful implementation of Eagle3 speculative decoding with
multimodal Llama4 models, supporting custom layer extraction and text-only
drafter processing while leveraging multimodal context from auxiliary hidden states.
The implementation now dynamically reads auxiliary layer configuration from the
draft model's speculative config, eliminating hardcoded layer IDs.
- Fix aux_hidden_state_layers initialization syntax error in qwen2.py
- Add missing return statement in qwen2_5_vl.py get_eagle3_aux_hidden_state_layers
- Improve error handling with hasattr check instead of assert
- Clean up method delegation to use direct return from language_model
- Add fallback default auxiliary layers for Qwen2.5VL models

These fixes enable Eagle3 speculative decoding support for Qwen2.5VL models.
Successfully tested with Qwen2.5VL-7B + Eagle3 configuration.
@rahul-tuli rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 7 times, most recently from 1f6fd40 to 5e93541 Compare October 3, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant