Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] add small vocab table for eagle's draft model. #3822

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Zhou-sx
Copy link

@Zhou-sx Zhou-sx commented Feb 24, 2025

Motivation

In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
To further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.

Modifications

Add speculative-token-map and speculative_token_map_num_dynamic_tokens in ServerArgs.
I designed a small vocabulary that supports offline settings and online updates, which can be used to compress the computational load of the lm-head part in the LlamaEagle Model.

Checklist

How to use

Step 1:
Launch an sgalng server, run it on the target dataset, and save the outputs.

# Example
python3 {sglang}/benchmark/mtbench/bench_sglang.py

Step 2:
Users can obtain generate_hot_token_ids.py from this GitHub Gist(https://gist.github.com/Zhou-sx/71a9196d2f324c93f79016579fdf57da) as a reference implementation. The script extracts the top-k high-frequency tokens from the saved output files and saves them into hot_token_ids.pt.
Step 3:
Launch server. It is recommended to first try an offline small vocabulary, and if the performance is not satisfactory, then try setting calculative_token_mapunum-dynamic_tokens.

#Use small vocab table for draft model
python3 -m sglang.launch_server \
--model-path "meta-llama/Llama-3.1-8B-Instruct" \
--speculative-algorithm "EAGLE" \
--speculative-draft-model-path "lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B" \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--cuda-graph-max-bs 8 \
--dtype "bfloat16" \
--speculative-token-map {hot_token_ids.pt}
--speculative-token-map-num-dynamic-tokens  256

Performance

  1. Environment
    Big model:meta-llama/Llama-3.1-8B-Instruct
    Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
    Device: 1*H20
  2. Detailed cost breakdown for each part of Eagle
    I analyzed a segment from profile.
stage part Time (us) stage part Time (us)
Baseline Draft model decode lm_head 482.01 After optimize Draft model decode lm_head 62.40
others 493.00 others 421.03
total 975.01 total 483.43
Draft model extend - 3305.40 Draft model extend - 2433.08
Target model - 14233.15 Target model - 13720.46
others - 1473.88 others - 2594.46
total - 20962.45 total - 19717.86

After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.

3.End to End Test:

  • mtbench:
    base:111.51 token/s
    after optimization:119.27 token/s
  • private dataset(A100):
    base: 84.49 tokens/s
    after optimization:100.24 token/s

@Zhou-sx Zhou-sx marked this pull request as ready for review February 25, 2025 02:44
@Zhou-sx Zhou-sx changed the title [Eagle] small vocab table for draft model. [feat] add small vocab table for eagle's draft model. Feb 26, 2025
@zhaochenyang20
Copy link
Collaborator

Great PR. I will ask our speculative decoding team to review this!

@Zhou-sx
Copy link
Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

@zhaochenyang20
Copy link
Collaborator

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

@Zhou-sx
Copy link
Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

@Achazwl
Copy link
Contributor

Achazwl commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

@Zhou-sx
Copy link
Author

Zhou-sx commented Feb 26, 2025

Great PR. I will ask our speculative decoding team to review this!

Thanks!

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

wow!I will look at this.

@zhaochenyang20
Copy link
Collaborator

@Achazwl Hey. We are bumping out eagle codes these days. We will @Zhou-sx on hold this for a while. But the codes looks nice to us. Wait for our update. Thanks!

# Update lm_head
self.hot_token_pool.add_token(batch.input_ids)
self.hot_token_ids = self.hot_token_pool.get_hot_token_ids()
self.model_runner.model.lm_head.weight = self.target_worker_lm_haed.data[self.hot_token_ids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is time-consuming when running at each iteration.

@Achazwl
Copy link
Contributor

Achazwl commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments.
Model: LLama-3-8B-Instruct
Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math).
Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same.
The correctness of the code is verified.

@Achazwl
Copy link
Contributor

Achazwl commented Feb 26, 2025

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

@Achazwl
Copy link
Contributor

Achazwl commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

The top 32k frequency token is uploaded at link.

@Zhou-sx
Copy link
Author

Zhou-sx commented Feb 26, 2025

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

  1. eagle_original: The vanilla EAGLE-2.
  2. eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
  3. eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

good idea. I agree with what you said. I will test the performance in your way after while.

@Zhou-sx
Copy link
Author

Zhou-sx commented Feb 26, 2025

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants