Releases: ModelTC/lightllm
Releases · ModelTC/lightllm
v1.0.1
Highlights
- DeepSeek-R1 Multi-Node H100 Deployment Support
- FlashInfer Integration
- XGrammer Integration
What's Changed
- Benchclient by @shihaobai in #740
- fix pause reqs by @shihaobai in #741
- add RETURN_LIST for tgi_api by @shihaobai in #742
- fix: fix a precision bug in the context_flashattention by @blueswhen in #743
- Improve the accuracy of deepseekv3 by @hiworldwzj in #744
- deepseekv3 bmm noquant and fix moe gemm bug. by @hiworldwzj in #745
- Add Xgrammar Support by @flyinglandlord in #701
- fuse fp8 quant in kv copying and add flashinfer decode mla operator in the attention module by @blueswhen in #737
- fix: add flashinfer-python in the requirements.txt by @blueswhen in #749
- Fix tokens2 by @SangChengC in #748
- Fix Unit-test in PR: Add xgrammar by @flyinglandlord in #750
- add support for multinode tp by @shihaobai in #751
Full Changelog: v1.0.0...v1.0.1
LightLLM v1.0.0 Release!
New Features
-
Cross-Process Request Object:
- Retained and optimized the previous three-process architecture design.
- Introduced a request object that can be accessed across processes, significantly reducing inter-process communication overhead.
-
Folding of scheduling and model inference:
- Implemented the folding of scheduling and model inference, significantly reducing communication overhead between the scheduler and modelrpc.
-
CacheTensorManager:
- New class to manage the allocation and release of Torch tensors within the framework.
- Maximizes tensor sharing across layers at runtime and enhances memory sharing between different CUDA graphs.
- On an 8x80GB H100 machine, using the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without out of memory (OOM).
-
PD-Disaggregation Prototype
- Dynamic registration of P and D nodes
-
Fastest DeepSeek-R1 performance on H200
For more details, stay tuned to our blog at https://www.light-ai.top/lightllm-blog/. Thanks to outstanding projects like vllm, sglang, and trtllm, LightLLM also leverages some of the high-performance quantization kernels from vllm. We hope to collaborate in driving the growth of the open-source community.