[feat] add small vocab table for eagle's draft model. #3822

Zhou-sx · 2025-02-24T15:12:46Z

Motivation

In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
To further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.

Modifications

Add speculative-token-map and speculative_token_map_num_dynamic_tokens in ServerArgs.
I designed a small vocabulary that supports offline settings and online updates, which can be used to compress the computational load of the lm-head part in the LlamaEagle Model.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

How to use

Step 1:
Launch an sgalng server, run it on the target dataset, and save the outputs.

# Example
python3 {sglang}/benchmark/mtbench/bench_sglang.py

Step 2:
Users can obtain generate_hot_token_ids.py from this GitHub Gist(https://gist.github.com/Zhou-sx/71a9196d2f324c93f79016579fdf57da) as a reference implementation. The script extracts the top-k high-frequency tokens from the saved output files and saves them into hot_token_ids.pt.
Step 3:
Launch server. It is recommended to first try an offline small vocabulary, and if the performance is not satisfactory, then try setting calculative_token_mapunum-dynamic_tokens.

#Use small vocab table for draft model
python3 -m sglang.launch_server \
--model-path "meta-llama/Llama-3.1-8B-Instruct" \
--speculative-algorithm "EAGLE" \
--speculative-draft-model-path "lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B" \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--cuda-graph-max-bs 8 \
--dtype "bfloat16" \
--speculative-token-map {hot_token_ids.pt}
--speculative-token-map-num-dynamic-tokens  256

Performance

Environment
Big model:meta-llama/Llama-3.1-8B-Instruct
Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
Device: 1*H20
Detailed cost breakdown for each part of Eagle
I analyzed a segment from profile.

	stage	part	Time (us)		stage	part	Time (us)
Baseline	Draft model decode	lm_head	482.01	After optimize	Draft model decode	lm_head	62.40
		others	493.00			others	421.03
		total	975.01			total	483.43
	Draft model extend	-	3305.40		Draft model extend	-	2433.08
	Target model	-	14233.15		Target model	-	13720.46
	others	-	1473.88		others	-	2594.46
	total	-	20962.45		total	-	19717.86

After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.

3.End to End Test:

mtbench：
base:111.51 token/s
after optimization:119.27 token/s
private dataset(A100)：
base: 84.49 tokens/s
after optimization:100.24 token/s

zhaochenyang20 · 2025-02-26T02:31:41Z

Great PR. I will ask our speculative decoding team to review this!

Zhou-sx · 2025-02-26T02:38:29Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

zhaochenyang20 · 2025-02-26T02:40:28Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

Zhou-sx · 2025-02-26T02:50:28Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

Achazwl · 2025-02-26T02:53:25Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

Zhou-sx · 2025-02-26T02:59:10Z

Great PR. I will ask our speculative decoding team to review this!

Thanks！

I‘ve asked weilin, author of the paper you inplemented. He will take a look.

ok. But I don't know which paper you mentioned. Can you provide a link?

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea.

wow！I will look at this.

zhaochenyang20 · 2025-02-26T05:15:51Z

@Achazwl Hey. We are bumping out eagle codes these days. We will @Zhou-sx on hold this for a while. But the codes looks nice to us. Wait for our update. Thanks!

Achazwl · 2025-02-26T05:42:29Z

python/sglang/srt/speculative/eagle_worker.py

+                # Update lm_head
+                self.hot_token_pool.add_token(batch.input_ids)
+                self.hot_token_ids = self.hot_token_pool.get_hot_token_ids()
+                self.model_runner.model.lm_head.weight = self.target_worker_lm_haed.data[self.hot_token_ids]


This operation is time-consuming when running at each iteration.

Achazwl · 2025-02-26T06:02:39Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments.
Model: LLama-3-8B-Instruct
Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math).
Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.
eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)
eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.

eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607

It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same.
The correctness of the code is verified.

Achazwl · 2025-02-26T06:23:28Z

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

Achazwl · 2025-02-26T06:36:29Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.

eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)

eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

The top 32k frequency token is uploaded at link.

Zhou-sx · 2025-02-26T09:41:19Z

I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Model: LLama-3-8B-Instruct Data: SpecBench (a benchmark including 6 tasks: Conversation, Translation, RAG, Summarization, QA, Math). Device: 1 x A800

Speed performance

eagle_original: The vanilla EAGLE-2.

eagle_static: Use a smaller lm_head based on static frequency in FR-Spec. (by turning off the dynamic logic in your code)

eagle_dynamic: Use our static frequency of FR-Spec as initialization and use @Zhou-sx's extra dynamic frequency update logic based on the input.
eagle_original vs baseline
============================== Task:  overall ==============================
Tokens per second:  152.14912293271973
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.8180933447773733

eagle_static vs baseline
============================== Task:  overall ==============================
Tokens per second:  168.124455586193
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  2.008989259377735

eagle_dynamic vs baseline
============================== Task:  overall ==============================
Tokens per second:  162.4540419109253
Tokens per second for the baseline:  83.68608980929442
Speedup ratio:  1.9412311207409607
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight.

Correctness

I check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. The correctness of the code is verified.

@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future.

good idea. I agree with what you said. I will test the performance in your way after while.

Zhou-sx · 2025-02-26T09:59:13Z

I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state

Zhou-sx added 2 commits February 24, 2025 14:13

opt eagle lm_head

d96ce21

opt eagle dynamic lm_head

205eea1

Zhou-sx marked this pull request as ready for review February 25, 2025 02:44

Zhou-sx requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners February 25, 2025 02:44

Merge branch 'main' into simplify_lm_head

db612f6

Zhou-sx changed the title ~~[Eagle] small vocab table for draft model.~~ [feat] add small vocab table for eagle's draft model. Feb 26, 2025

Achazwl reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] add small vocab table for eagle's draft model. #3822

[feat] add small vocab table for eagle's draft model. #3822

Zhou-sx commented Feb 24, 2025 •

edited

Loading

zhaochenyang20 commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

zhaochenyang20 commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

Achazwl commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

zhaochenyang20 commented Feb 26, 2025

Achazwl Feb 26, 2025

Achazwl commented Feb 26, 2025 •

edited

Loading

Achazwl commented Feb 26, 2025

Achazwl commented Feb 26, 2025

Speed performance

Correctness

Zhou-sx commented Feb 26, 2025

Speed performance

Correctness

Zhou-sx commented Feb 26, 2025

[feat] add small vocab table for eagle's draft model. #3822

Are you sure you want to change the base?

[feat] add small vocab table for eagle's draft model. #3822

Conversation

Zhou-sx commented Feb 24, 2025 • edited Loading

Motivation

Modifications

Checklist

How to use

Performance

zhaochenyang20 commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

zhaochenyang20 commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

Achazwl commented Feb 26, 2025

Zhou-sx commented Feb 26, 2025

zhaochenyang20 commented Feb 26, 2025

Achazwl Feb 26, 2025

Choose a reason for hiding this comment

Achazwl commented Feb 26, 2025 • edited Loading

Speed performance

Correctness

Achazwl commented Feb 26, 2025

Achazwl commented Feb 26, 2025

Speed performance

Correctness

Zhou-sx commented Feb 26, 2025

Speed performance

Correctness

Zhou-sx commented Feb 26, 2025

Zhou-sx commented Feb 24, 2025 •

edited

Loading

Achazwl commented Feb 26, 2025 •

edited

Loading