-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] add small vocab table for eagle's draft model. #3822
base: main
Are you sure you want to change the base?
Conversation
Great PR. I will ask our speculative decoding team to review this! |
Thanks! |
I‘ve asked weilin, author of the paper you inplemented. He will take a look. |
ok. But I don't know which paper you mentioned. Can you provide a link? |
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling https://arxiv.org/abs/2502.14856. I think we share the same idea. |
wow!I will look at this. |
# Update lm_head | ||
self.hot_token_pool.add_token(batch.input_ids) | ||
self.hot_token_ids = self.hot_token_pool.get_hot_token_ids() | ||
self.model_runner.model.lm_head.weight = self.target_worker_lm_haed.data[self.hot_token_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is time-consuming when running at each iteration.
I use the top 32k frequency tokens from FR-Spec and use your code for experiments. Speed performance
It seems that the dynamic update logic has some negative effects. It requires extra time to maintain the frequency and needs recopying the lm_head's weight. CorrectnessI check the generated output of eagle_original, eagle_static and eagle_dynamic, they are the same. |
@Zhou-sx, I think your dynamic logic would benefit certain scenarios. Would you consider formalizing your dynamic frequency updating logic as a configurable server argument? This would allow users to toggle it on/off based on their needs while inviting the team to contribute different dynamic strategies in the future. |
The top 32k frequency token is uploaded at link. |
good idea. I agree with what you said. I will test the performance in your way after while. |
I do not add dynamic updates at first, but in some test cases, the performance was worse than baseline. In fact, when calculative_token_map_num_d ynamic_token is setted to 0, it is in a closed state |
Motivation
In speculative decoding, while the draft model's decode layers are drastically reduced, the computational bottleneck shifts to its lm_head (proportional to vocabulary size). In our case, the profile file shows that the small model decode takes approximately 1.8ms at a time, with lm-head occupying 860us, which is about half of the small model. By leveraging the target model's validation to ensure correctness, the draft model can safely use a truncated vocabulary—retaining only high-frequency tokens. This optimization reduces lm_head overhead, further accelerating the Eagle speculative decoding pipeline without compromising quality.
To further enhance efficiency, I propose dynamically updating the reduced vocabulary (token_map) during inference. This ensures the draft model adapts to shifting token distributions in real-time workloads.
Modifications
Add speculative-token-map and speculative_token_map_num_dynamic_tokens in ServerArgs.
I designed a small vocabulary that supports offline settings and online updates, which can be used to compress the computational load of the lm-head part in the LlamaEagle Model.
Checklist
How to use
Step 1:
Launch an sgalng server, run it on the target dataset, and save the outputs.
Step 2:
Users can obtain generate_hot_token_ids.py from this GitHub Gist(https://gist.github.com/Zhou-sx/71a9196d2f324c93f79016579fdf57da) as a reference implementation. The script extracts the top-k high-frequency tokens from the saved output files and saves them into hot_token_ids.pt.
Step 3:
Launch server. It is recommended to first try an offline small vocabulary, and if the performance is not satisfactory, then try setting calculative_token_mapunum-dynamic_tokens.
Performance
Big model:meta-llama/Llama-3.1-8B-Instruct
Draft model:lmzheng/sglang-EAGLE-LLaMA3-Instruct-8B
Device: 1*H20
I analyzed a segment from profile.
After adopting a smaller vocabulary for the compact model, the inference speed of the lm_head improved to 7.72× the original speed. Additionally, the time consumption of draft decode phase was reduced by 50.42%, and the extend phase saw a 26.30% reduction in processing time.
Overall, the overhead introduced by our approach is less than the time saved through small-model speculation, making this a promising attempt at inference acceleration.
3.End to End Test:
base:111.51 token/s
after optimization:119.27 token/s
base: 84.49 tokens/s
after optimization:100.24 token/s