Skip to content

Commit 80bc075

Browse files
kaiyuxShixiaowei02
andauthored
Update TensorRT-LLM Release branch (#745)
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
1 parent a8018c1 commit 80bc075

File tree

19 files changed

+450
-169
lines changed

19 files changed

+450
-169
lines changed

CHANGELOG.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Change Log
2+
3+
## Versions 0.6.0 / 0.6.1
4+
5+
* Models
6+
* ChatGLM3
7+
* InternLM (contributed by @wangruohui)
8+
* Mistral 7B (developed in collaboration with Mistral.AI)
9+
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
10+
* Qwen (contributed by @Tlntin and @zhaohb)
11+
* Replit Code V-1.5 3B (external contribution)
12+
* T5, mT5, Flan-T5 (Python runtime only)
13+
14+
* Features
15+
* Add runtime statistics related to active requests and KV cache
16+
utilization from the batch manager (see
17+
the [batch manager](docs/source/batch_manager.md) documentation)
18+
* Add `sequence_length` tensor to support proper lengths in beam-search
19+
(when beam-width > 1 - see
20+
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
21+
* BF16 support for encoder-decoder models (Python runtime - see
22+
[examples/enc_dec](examples/enc_dec/README.md))
23+
* Improvements to memory utilization (CPU and GPU - including memory
24+
leaks)
25+
* Improved error reporting and memory consumption
26+
* Improved support for stop and bad words
27+
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
28+
[examples/baichuan](examples/baichuan/README.md))
29+
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
30+
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
31+
* INT4 AWQ support for the Falcon models
32+
(see [examples/falcon](examples/falcon/README.md))
33+
* LoRA support (functional preview only - limited to the Python runtime,
34+
only QKV support and not optimized in terms of runtime performance) for
35+
the GPT model (see the
36+
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
37+
in the GPT example)
38+
* Multi-GPU support for encoder-decoder models (Python runtime - see
39+
[examples/enc_dec](examples/enc_dec/README.md))
40+
* New heuristic for launching the Multi-block Masked MHA kernel (similar
41+
to FlashDecoding - see
42+
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
43+
* Prompt-Tuning support for GPT and LLaMA models (see the
44+
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
45+
* Performance optimizations in various CUDA kernels
46+
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
47+
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
48+
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
49+
* Support for different micro batch sizes for context and generation
50+
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
51+
`GptSession::Config::genMicroBatchSize` in
52+
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
53+
* Support for "remove input padding" for encoder-decoder models (see
54+
[examples/enc_dec](examples/enc_dec/README.md))
55+
* Support for context and generation logits (see `mComputeContextLogits` and
56+
`mComputeGenerationLogits` in
57+
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
58+
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
59+
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
60+
* Update to CUTLASS 3.x
61+
62+
* Bug fixes
63+
* Fix for ChatGLM2 #93 and #138
64+
* Fix tensor names error "RuntimeError: Tensor names
65+
(`host_max_kv_cache_length`) in engine are not the same as expected in
66+
the main branch" #369
67+
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
68+
does not result in an equal division") #374
69+
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
70+
* Fix a crash in GenerationSession if stream keyword argument is not None
71+
#202
72+
* Fix a typo when calling PyNVML API [BUG] code bug #410
73+
* Fix bugs related to the improper management of the `end_id` for various
74+
models [C++ and Python]
75+
* Fix memory leaks [C++ code and Python models]
76+
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
77+
gptManagerBenchmark std::bad_alloc error #66
78+
* Fix a bug in pipeline parallelism when beam-width > 1
79+
* Fix a bug with Llama GPTQ due to improper support of GQA
80+
* Fix issue #88
81+
* Fix an issue with the Huggingface Transformers version #16
82+
* Fix link jump in windows readme.md #30 - by @yuanlehome
83+
* Fix typo in batchScheduler.h #56 - by @eltociear
84+
* Fix typo #58 - by @RichardScottOZ
85+
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
86+
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
87+
* Fix the log message to be more accurate on KV cache #224
88+
* Fix Windows release wheel installation: Failed to install the release
89+
wheel for Windows using pip #261
90+
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
91+
in --cpp-only when torch's cxx_abi version is different with gcc #151
92+
* Fix linking error during compiling google-test & benchmarks #277
93+
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
94+
the lack of bfloat16 #335
95+
* Minor bug fixes
96+
97+
## Version 0.5.0
98+
99+
* TensorRT-LLM v0.5.0 is the first public release.

README.md

Lines changed: 41 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ TensorRT-LLM
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
1010
[![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.7.0-green)](./setup.py)
11+
[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -108,16 +108,16 @@ concepts used in TensorRT-LLM, we recommend you to read the following
108108

109109
## Installation
110110

111-
*For Windows installation, see [`Windows`](windows/README.md).*
112-
113-
TensorRT-LLM must be built from source, instructions can be found
111+
The documentation for installing TensorRT-LLM can be found
114112
[here](./docs/source/installation.md). An image of a Docker container with
115113
TensorRT-LLM and its Triton Inference Server Backend will be made available
116114
soon.
117115

118116
The remaining commands in that document must be executed from the TensorRT-LLM
119117
container.
120118

119+
*For Windows installation, see [`Windows`](windows/README.md).*
120+
121121
## Quick Start
122122

123123
To create a TensorRT engine for an existing model, there are 3 steps:
@@ -379,103 +379,43 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
379379

380380
### Change Log
381381

382-
#### Version 0.6.1
383-
384-
* Models
385-
* ChatGLM3
386-
* InternLM (contributed by @wangruohui)
387-
* Mistral 7B (developed in collaboration with Mistral.AI)
388-
* MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
389-
* Qwen (contributed by @Tlntin and @zhaohb)
390-
* Replit Code V-1.5 3B (external contribution)
391-
* T5, mT5, Flan-T5 (Python runtime only)
392-
393-
* Features
394-
* Add runtime statistics related to active requests and KV cache
395-
utilization from the batch manager (see
396-
the [batch manager](docs/source/batch_manager.md) documentation)
397-
* Add `sequence_length` tensor to support proper lengths in beam-search
398-
(when beam-width > 1 - see
399-
[tensorrt_llm/batch_manager/GptManager.h](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
400-
* BF16 support for encoder-decoder models (Python runtime - see
401-
[examples/enc_dec](examples/enc_dec/README.md))
402-
* Improvements to memory utilization (CPU and GPU - including memory
403-
leaks)
404-
* Improved error reporting and memory consumption
405-
* Improved support for stop and bad words
406-
* INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see
407-
[examples/baichuan](examples/baichuan/README.md))
408-
* INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only
409-
support for the GPT-J model (see [examples/gptj](examples/gptj/README.md))
410-
* INT4 AWQ support for the Falcon models
411-
(see [examples/falcon](examples/falcon/README.md))
412-
* LoRA support (functional preview only - limited to the Python runtime,
413-
only QKV support and not optimized in terms of runtime performance) for
414-
the GPT model (see the
415-
[Run LoRA with the Nemo checkpoint](examples/gpt/README.md#Run-LoRA-with-the-Nemo-checkpoint)
416-
in the GPT example)
417-
* Multi-GPU support for encoder-decoder models (Python runtime - see
418-
[examples/enc_dec](examples/enc_dec/README.md))
419-
* New heuristic for launching the Multi-block Masked MHA kernel (similar
420-
to FlashDecoding - see
421-
[decoderMaskedMultiheadAttentionLaunch.h](cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h))
422-
* Prompt-Tuning support for GPT and LLaMA models (see the
423-
[Prompt-tuning](examples/gpt/README.md#Prompt-tuning) Section in the GPT example)
424-
* Performance optimizations in various CUDA kernels
425-
* Possibility to exclude input tokens from the output (see `excludeInputInOutput` in
426-
[`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
427-
* Python binding for the C++ runtime (GptSession - see [`pybind`](cpp/tensorrt_llm/pybind))
428-
* Support for different micro batch sizes for context and generation
429-
phases with pipeline parallelism (see `GptSession::Config::ctxMicroBatchSize` and
430-
`GptSession::Config::genMicroBatchSize` in
431-
[tensorrt_llm/runtime/gptSession.h](cpp/include/tensorrt_llm/runtime/gptSession.h))
432-
* Support for "remove input padding" for encoder-decoder models (see
433-
[examples/enc_dec](examples/enc_dec/README.md))
434-
* Support for context and generation logits (see `mComputeContextLogits` and
435-
`mComputeGenerationLogits` in
436-
[tensorrt_llm/runtime/gptModelConfig.h](cpp/include/tensorrt_llm/runtime/gptModelConfig.h))
437-
* Support for `logProbs` and `cumLogProbs` (see `"output_log_probs"` and
438-
`"cum_log_probs"` in [`GptManager`](cpp/include/tensorrt_llm/batch_manager/GptManager.h))
439-
* Update to CUTLASS 3.x
440-
441-
* Bug fixes
442-
* Fix for ChatGLM2 #93 and #138
443-
* Fix tensor names error "RuntimeError: Tensor names
444-
(`host_max_kv_cache_length`) in engine are not the same as expected in
445-
the main branch" #369
446-
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
447-
does not result in an equal division") #374
448-
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
449-
* Fix a crash in GenerationSession if stream keyword argument is not None
450-
#202
451-
* Fix a typo when calling PyNVML API [BUG] code bug #410
452-
* Fix bugs related to the improper management of the `end_id` for various
453-
models [C++ and Python]
454-
* Fix memory leaks [C++ code and Python models]
455-
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
456-
gptManagerBenchmark std::bad_alloc error #66
457-
* Fix a bug in pipeline parallelism when beam-width > 1
458-
* Fix a bug with Llama GPTQ due to improper support of GQA
459-
* Fix issue #88
460-
* Fix an issue with the Huggingface Transformers version #16
461-
* Fix link jump in windows readme.md #30 - by @yuanlehome
462-
* Fix typo in batchScheduler.h #56 - by @eltociear
463-
* Fix typo #58 - by @RichardScottOZ
464-
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
465-
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
466-
* Fix the log message to be more accurate on KV cache #224
467-
* Fix Windows release wheel installation: Failed to install the release
468-
wheel for Windows using pip #261
469-
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
470-
in --cpp-only when torch's cxx_abi version is different with gcc #151
471-
* Fix linking error during compiling google-test & benchmarks #277
472-
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
473-
the lack of bfloat16 #335
474-
* Minor bug fixes
475-
476-
#### Version 0.5.0
477-
478-
* TensorRT-LLM v0.5.0 is the first public release.
382+
#### Versions 0.7.0 / 0.7.1
383+
384+
* Models
385+
- BART and mBART support in encoder-decoder models
386+
- FairSeq Neural Machine Translation (NMT) family
387+
- Mixtral-8x7B model
388+
- Support weight loading for HuggingFace Mixtral model
389+
- OpenAI Whisper
390+
- Mixture of Experts support
391+
- MPT - Int4 AWQ / SmoothQuant support
392+
- Baichuan FP8 quantization support
393+
* Features
394+
- [Preview] Speculative decoding
395+
- Add Python binding for `GptManager`
396+
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
397+
- System prompt caching
398+
- Enable split-k for weight-only cutlass kernels
399+
- FP8 KV cache support for XQA kernel
400+
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
401+
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
402+
- fMHA support for chunked attention and paged kv cache
403+
* Bug fixes
404+
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
405+
- Fix LLaMa with LoRA error #637
406+
- Fix LLaMA GPTQ failure #580
407+
- Fix Python binding for InferenceRequest issue #528
408+
- Fix CodeLlama SQ accuracy issue #453
409+
* Performance
410+
- MMHA optimization for MQA and GQA
411+
- LoRA optimization: cutlass grouped gemm
412+
- Optimize Hopper warp specialized kernels
413+
- Optimize AllReduce for parallel attention on Falcon and GPT-J
414+
- Enable split-k for weight-only cutlass kernel when SM>=75
415+
* Documentation
416+
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
417+
418+
#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
479419

480420
### Known Issues
481421

benchmarks/python/allowed_configs.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,7 @@ class ModelConfig:
232232
builder_opt=None,
233233
pre_norm=False,
234234
do_layer_norm_before=False,
235+
use_custom_all_reduce=False,
235236
)),
236237
"opt_2.7b":
237238
ModelConfig(name="opt_2.7b",
@@ -250,6 +251,7 @@ class ModelConfig:
250251
builder_opt=None,
251252
pre_norm=False,
252253
do_layer_norm_before=True,
254+
use_custom_all_reduce=False,
253255
)),
254256
"opt_6.7b":
255257
ModelConfig(name="opt_6.7b",
@@ -268,6 +270,7 @@ class ModelConfig:
268270
builder_opt=None,
269271
pre_norm=False,
270272
do_layer_norm_before=True,
273+
use_custom_all_reduce=False,
271274
)),
272275
"opt_66b":
273276
ModelConfig(name="opt_66b",
@@ -286,6 +289,7 @@ class ModelConfig:
286289
builder_opt=None,
287290
pre_norm=True,
288291
do_layer_norm_before=True,
292+
use_custom_all_reduce=False,
289293
)),
290294
"llama_7b":
291295
ModelConfig(name="llama_7b",
@@ -512,6 +516,7 @@ class ModelConfig:
512516
max_output_len=200,
513517
builder_opt=None,
514518
remove_input_padding=False,
519+
use_custom_all_reduce=False,
515520
)),
516521
"bloom_560m":
517522
ModelConfig(name="bloom_560m",
@@ -528,6 +533,7 @@ class ModelConfig:
528533
max_input_len=1024,
529534
max_output_len=1024,
530535
builder_opt=None,
536+
use_custom_all_reduce=False,
531537
)),
532538
"bloom_176b":
533539
ModelConfig(name="bloom_176b",
@@ -544,6 +550,7 @@ class ModelConfig:
544550
max_input_len=1024,
545551
max_output_len=1024,
546552
builder_opt=None,
553+
use_custom_all_reduce=False,
547554
)),
548555
"bert_base":
549556
ModelConfig(name="bert_base",
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:717c7aac842fe8d8cc52e07740d6a158889ab1ae07d02e6575e1eb3e640848c1
2+
oid sha256:c98f8854a1d8967775c94bb96a5a37dca190f1fa808b3f846db870c30cce2bfd
33
size 1801434
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:2d9f22df17f665e526b1db997e6de4dfa2ca5e22c1a13cb125fc02e07389e43f
2+
oid sha256:ed514ea9c0634d4fc95a0a53e7719f72ec8e2b0a596d1e8a60516652f66b8ca2
33
size 1819282
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
516ff2db1e17536e92150b0c05200589 libtensorrt_llm_batch_manager_static.a
2-
428a500536705184a1aad8aaf5c9c0ca libtensorrt_llm_batch_manager_static.pre_cxx11.a
3-
33b6139e3bb108df093aab3a6de38a87f1f1e2dd commit
1+
ffe001b0bf9ee66b3e3696423d6d09a2 libtensorrt_llm_batch_manager_static.a
2+
3657ea3400959a64be77c12d8598dd72 libtensorrt_llm_batch_manager_static.pre_cxx11.a
3+
9a775b3dbb20444f130f13f90e675cc971fe7e15 commit
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:4b4c8f559dddb001f8355a0162423af385e6803376d2cb4f9b9c37f7840659e0
2+
oid sha256:542ccb1497c91d82048eb9bec07527317c702e9c7466923d8b61e12374e087fb
33
size 1722062
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:b156f4fbdafcb12ae7c39be35da04b10fc42cd911f15f03f892c2d118ec3825a
2+
oid sha256:51a4dc2d8e2b7624976fb5f8370b8f44e1c25f038bd3915e1c31eb63c60b7c22
33
size 1715766
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
0403e89a23fd77aed43cac0ecd8136cf libtensorrt_llm_batch_manager_static.a
2-
9fa2a1c18860eaf226a6ce61a8e3ed5d libtensorrt_llm_batch_manager_static.pre_cxx11.a
1+
bb69bf376c5f955c327e867049639d78 libtensorrt_llm_batch_manager_static.a
2+
14b107676c74ce17bfc8ce950b36a984 libtensorrt_llm_batch_manager_static.pre_cxx11.a

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,8 @@ class FusedMHARunnerV2::mhaImpl
121121
if (mLaunchParams.useKernelWithoutAlibi)
122122
{
123123
// The kernel adopts the log2f optimziation.
124-
set_alpha(params.scale_bmm1, scale_bmm1 * float(M_LOG2E), DATA_TYPE_FP32);
124+
constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
125+
set_alpha(params.scale_bmm1, scale_bmm1 * float(kLog2e), DATA_TYPE_FP32);
125126
}
126127
else
127128
{

0 commit comments

Comments
 (0)