You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(`host_max_kv_cache_length`) in engine are not the same as expected in
445
-
the main branch" #369
446
-
* Fix weights split issue in BLOOM when `world_size = 2` ("array split
447
-
does not result in an equal division") #374
448
-
* Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
449
-
* Fix a crash in GenerationSession if stream keyword argument is not None
450
-
#202
451
-
* Fix a typo when calling PyNVML API [BUG] code bug #410
452
-
* Fix bugs related to the improper management of the `end_id` for various
453
-
models [C++ and Python]
454
-
* Fix memory leaks [C++ code and Python models]
455
-
* Fix the std::alloc error when running the gptManagerBenchmark -- issue
456
-
gptManagerBenchmark std::bad_alloc error #66
457
-
* Fix a bug in pipeline parallelism when beam-width > 1
458
-
* Fix a bug with Llama GPTQ due to improper support of GQA
459
-
* Fix issue #88
460
-
* Fix an issue with the Huggingface Transformers version #16
461
-
* Fix link jump in windows readme.md #30 - by @yuanlehome
462
-
* Fix typo in batchScheduler.h #56 - by @eltociear
463
-
* Fix typo #58 - by @RichardScottOZ
464
-
* Fix Multi-block MMHA: Difference between `max_batch_size` in the engine
465
-
builder and `max_num_sequences` in TrtGptModelOptionalParams? #65
466
-
* Fix the log message to be more accurate on KV cache #224
467
-
* Fix Windows release wheel installation: Failed to install the release
468
-
wheel for Windows using pip #261
469
-
* Fix missing torch dependencies: [BUG] The batch_manage.a choice error
470
-
in --cpp-only when torch's cxx_abi version is different with gcc #151
471
-
* Fix linking error during compiling google-test & benchmarks #277
472
-
* Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by
473
-
the lack of bfloat16 #335
474
-
* Minor bug fixes
475
-
476
-
#### Version 0.5.0
477
-
478
-
* TensorRT-LLM v0.5.0 is the first public release.
382
+
#### Versions 0.7.0 / 0.7.1
383
+
384
+
* Models
385
+
- BART and mBART support in encoder-decoder models
386
+
- FairSeq Neural Machine Translation (NMT) family
387
+
- Mixtral-8x7B model
388
+
- Support weight loading for HuggingFace Mixtral model
389
+
- OpenAI Whisper
390
+
- Mixture of Experts support
391
+
- MPT - Int4 AWQ / SmoothQuant support
392
+
- Baichuan FP8 quantization support
393
+
* Features
394
+
-[Preview] Speculative decoding
395
+
- Add Python binding for `GptManager`
396
+
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
397
+
- System prompt caching
398
+
- Enable split-k for weight-only cutlass kernels
399
+
- FP8 KV cache support for XQA kernel
400
+
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
401
+
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
402
+
- fMHA support for chunked attention and paged kv cache
403
+
* Bug fixes
404
+
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
405
+
- Fix LLaMa with LoRA error #637
406
+
- Fix LLaMA GPTQ failure #580
407
+
- Fix Python binding for InferenceRequest issue #528
408
+
- Fix CodeLlama SQ accuracy issue #453
409
+
* Performance
410
+
- MMHA optimization for MQA and GQA
411
+
- LoRA optimization: cutlass grouped gemm
412
+
- Optimize Hopper warp specialized kernels
413
+
- Optimize AllReduce for parallel attention on Falcon and GPT-J
414
+
- Enable split-k for weight-only cutlass kernel when SM>=75
415
+
* Documentation
416
+
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
417
+
418
+
#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
0 commit comments