This is a Python package for accelerating the inference of Large Language Models (LLMs) by Speculative Decoding (SD), especially for Beam Search.
Requirements: transformers>4.41,<4.45
git clone xxx
cd AtSpeed
pip3 install -e .
Only one line of code is needed after import!
from beamsd import replace_beam_search_with_TreeAttn
model = replace_beam_search_with_TreeAttn(model)
Then you can use model.generate
as usual.
outputs = model.generate(**inputs, max_new_tokens=32, num_beams=5)
from atspeed.beamsd import beam_search_by_TreeAttn
outputs = beam_search_by_TreeAttn(model, inputs, max_new_tokens=32, beam_size=5)
It is recommended to set generation parameters in model.generation_config
instead of passing them directly into the function beam_search_by_SD
.
target_model.generation_config.update(**{
"max_new_tokens": max_new_tokens,
"num_beams": beam_size,
"num_return_sequences": beam_size,
})
draft_model.generation_config.update(**{
"max_new_tokens": gamma,
"num_beams": draft_beam_size,
"num_return_sequences": draft_beam_size,
})
from atspeed.beamsd import beam_search_by_SD
outputs = beam_search_by_SD(target_model, draft_model, inputs)
from atspeed.beamsd4timing import beam_search_by_SD_4timing
outputs = beam_search_by_SD_4timing(target_model, draft_model, inputs)
beam_search_by_SD_4timing
provides precise timing for each module, and thus may take longer in total execution time than beam_search_by_SD
due to the use of torch.cuda.synchronize
.
For more details, please refer to demo.ipynb
or the source code.
The experiment is conducted on Beauty dataset on an NVIDIA RTX A5000 GPU. target_model: LLaMA-7B, draft_model: LLaMA-68M, gamma=3, max_new_tokens=4, draft_beam_size=40, target_beam_size in {1,3,5,10,20}.
The code in this repository is mostly developed for or derived from the paper below. Please cite it if you find the repository helpful.
@article{lin2024efficient,
title={Efficient Inference for Large Language Model-based Generative Recommendation},
author={Lin, Xinyu and Yang, Chaoqun and Wang, Wenjie and Li, Yongqi and Du, Cunxiao and Feng, Fuli and Ng, See-Kiong and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2410.05165},
year={2024}
}
We are also planning to add more of our research to this repository, such as the top-K alignment between the draft model and the target model.