Skip to content

Releases: vllm-project/vllm

v0.6.1

11 Sep 21:44
3fd2b0d
Compare
Choose a tag to compare

Highlights

Model Support

  • Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
  • Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
  • Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

  • Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

  • Support load and unload LoRA in api server (#6566)
  • Add progress reporting to batch runner (#8060)
  • Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

  • Update the docker image to use Python 3.12 for small performance bump. (#8133)
  • Added CODE_OF_CONDUCT.md (#8161)

What's Changed

  • [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
  • [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
  • [Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
  • [ci] Mark LoRA test as soft-fail by @khluu in #8160
  • [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
  • [Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
  • Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
  • [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
  • Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
  • [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
  • [Core] Support load and unload LoRA in api server by @Jeffwan in #6566
  • [BugFix] Fix Granite model configuration by @njhill in #8216
  • [Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in #8191
  • [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
  • [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
  • [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
  • [Misc] Remove SqueezeLLM by @dsikka in #8220
  • [Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
  • [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
  • [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
  • [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
  • Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
  • [tpu][misc] fix typo by @youkaichao in #8260
  • [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
  • [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
  • [Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in #8269
  • ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
  • [CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
  • [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
  • [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
  • [Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
  • [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
  • [Misc] GPTQ Activation Ordering by @kylesayrs in #8135
  • [Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
  • Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
  • [Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in #8155
  • [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
  • [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
  • [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
  • Fix ppc64le buildkite job by @sumitd2 in #8309
  • [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
  • [Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
  • [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
  • [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
  • [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
  • [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
  • [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
  • [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
  • [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
  • [model] Support for Llava-Next-Video model by @TKONIY in #7559
  • [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
  • [Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
  • [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
  • [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
  • [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
  • [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
  • [Misc] Move device options to a single place by @akx in #8322
  • [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
  • Pixtral by @patrickvonplaten in #8377
  • Bump version to v0.6.1 by @simon-mo in #8379

New Contributors

Full Changelog: v0.6.0...v0.6.1

v0.6.0

04 Sep 23:35
32e7db2
Compare
Choose a tag to compare

Highlights

Performance Update

  • We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting --num-scheduler-steps 8 in the engine arguments. Please note that it still have some limitations and being actively hardened, see #7528 for known issues.
    • Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
    • Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
    • Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)

Model Support

  • Support bitsandbytes 8-bit and FP4 quantized models (#7445)
  • New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
  • A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
  • Multi-modality:
    • multi-image input support for LLaVA-Next (#7230), Phi-3-vision models (#7783)
    • Ultravox support for multiple audio chunks (#7963)
    • TP support for ViTs (#7186)

Hardware Support

  • NVIDIA GPU: extend cuda graph size for H200 (#7894)
  • AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
  • Intel GPU: pipeline parallel support (#7810)
  • Neuron: context lengths and token generation buckets (#7885, #8062)
  • TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)

Production Features

  • OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
  • Add json_schema support from OpenAI protocol (#7654)
  • Enable chunked prefill and prefix caching together (#7753, #8120)
  • Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)

Misc

  • Support benchmarking async engine in benchmark_throughput.py (#7964)
  • Progress in integration with torch.compile: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)

What's Changed

  • [Core] Add multi-step support to LLMEngine by @alexm-neuralmagic in #7789
  • [Bugfix] Fix run_batch logger by @pooyadavoodi in #7640
  • [Frontend] Publish Prometheus metrics in run_batch API by @pooyadavoodi in #7641
  • [Frontend] add json_schema support from OpenAI protocol by @rockwotj in #7654
  • [misc][core] lazy import outlines by @youkaichao in #7831
  • [ci][test] exclude model download time in server start time by @youkaichao in #7834
  • [ci][test] fix RemoteOpenAIServer by @youkaichao in #7838
  • [Bugfix] Fix Phi-3v crash when input images are of certain sizes by @zifeitong in #7840
  • [Model][VLM] Support multi-images inputs for Phi-3-vision models by @Isotr0py in #7783
  • [Misc] Remove snapshot_download usage in InternVL2 test by @Isotr0py in #7835
  • [misc][cuda] improve pynvml warning by @youkaichao in #7852
  • [Spec Decoding] Streamline batch expansion tensor manipulation by @njhill in #7851
  • [Bugfix]: Use float32 for base64 embedding by @HollowMan6 in #7855
  • [CI/Build] Avoid downloading all HF files in RemoteOpenAIServer by @DarkLight1337 in #7836
  • [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by @comaniac in #7822
  • [Misc] Update qqq to use vLLMParameters by @dsikka in #7805
  • [Misc] Update gptq_marlin_24 to use vLLMParameters by @dsikka in #7762
  • [misc] fix custom allreduce p2p cache file generation by @youkaichao in #7853
  • [Bugfix] neuron: enable tensor parallelism by @omrishiv in #7562
  • [Misc] Update compressed tensors lifecycle to remove prefix from create_weights by @dsikka in #7825
  • [Core] Asynchronous Output Processor by @megha95 in #7049
  • [Tests] Disable retries and use context manager for openai client by @njhill in #7565
  • [core][torch.compile] not compile for profiling by @youkaichao in #7796
  • Revert #7509 by @comaniac in #7887
  • [Model] Add Mistral Tokenization to improve robustness and chat encoding by @patrickvonplaten in #7739
  • [CI/Build][VLM] Cleanup multiple images inputs model test by @Isotr0py in #7897
  • [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by @jikunshang in #7810
  • [CI/Build][ROCm] Enabling tensorizer tests for ROCm by @alexeykondrat in #7237
  • [Bugfix] Fix phi3v incorrect image_idx when using async engine by @Isotr0py in #7916
  • [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by @youkaichao in #7924
  • [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7766
  • [benchmark] Update TGI version by @philschmid in #7917
  • [Model] Add multi-image input support for LLaVA-Next offline inference by @zifeitong in #7230
  • [mypy] Enable mypy type checking for vllm/core by @jberkhahn in #7229
  • [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by @petersalas in #7902
  • [hardware][rocm] allow rocm to override default env var by @youkaichao in #7926
  • [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by @bnellnm in #7886
  • [mypy][CI/Build] Fix mypy errors by @DarkLight1337 in #7929
  • [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by @alexm-neuralmagic in #7911
  • [Performance] Enable chunked prefill and prefix caching together by @comaniac in #7753
  • [ci][test] fix pp test failure by @youkaichao in #7945
  • [Doc] fix the autoAWQ example by @stas00 in #7937
  • [Bugfix][VLM] Fix incompatibility between #7902 and #7230 by @DarkLight1337 in #7948
  • [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by @pavanimajety in #7798
  • [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by @rasmith in #7386
  • [TPU] Upgrade PyTorch XLA nightly by @WoosukKwon in #7967
  • [Doc] fix 404 link by @stas00 in #7966
  • [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by @mzusman in #7651
  • [Bugfix] Make torch registration of punica ops optional by @bnellnm in #7970
  • [torch.compile] avoid Dynamo guard evaluation overhead by @youkaichao in #7898
  • Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by @mgoin in #7961
  • [Frontend] Minor optimizations to zmq decoupled front-end by @njhill in #7957
  • [torch.compile] remove reset by @youkaichao in #7975
  • [VLM][Core] Fix exceptions on ragged NestedTensors by @petersalas in #7974
  • Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by @youkaichao in #7982
  • [Bugfix] Unify rank computation across regular decoding and speculative decoding by @jmkuebler in #7899
  • [Core] Combine async postprocessor and multi-step by @alexm-neuralmagic in #7921
  • [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by @pavanimajety in #7985
  • extend cuda graph size for H200 by @kushanam in #7894
  • [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by @Isotr0py in #7954
  • [misc] update tpu int8 to use new vLLM Parameters by @dsikka in #7973
  • [Neuron] Adding support for context-lenght, token-gen buckets. by @hbikki in #7885
  • support bitsandbytes 8-bit and FP4 quantized models by @chenqianfzh in #7445
  • Add more percentiles and latencies by @...
Read more

v0.5.5

23 Aug 18:37
09c7792
Compare
Choose a tag to compare

Highlights

Performance Update

  • We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
  • Various enhancements:
    • Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
    • Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
    • Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

  • Support Jamba 1.5 (#7415, #7601, #6739)
  • Support for the first audio model UltravoxModel (#7615, #7446)
  • Improvements to vision models:
    • Support image embeddings as input (#6613)
    • Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
  • Support loading GGUF model (#5191) with tensor parallelism (#7520)
  • Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

  • AMD: Add fp8 Linear Layer for rocm (#7210)
  • Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
  • Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

  • Optimize prefix caching performance (#7193)
  • Speculative decoding
    • Use target model max length as default for draft model (#7706)
    • EAGLE Implementation with Top-1 proposer (#6830)
  • Entrypoints
    • A new chat method in the LLM class (#5049)
    • Support embeddings in the run_batch API (#7132)
    • Support prompt_logprobs in Chat Completion (#7453)
  • Quantizations
    • Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
    • Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
  • torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

  • [ci][frontend] deduplicate tests by @youkaichao in #7101
  • [Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
  • [Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
  • [MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
  • [Core] Support loading GGUF model by @Isotr0py in #5191
  • [Build] Add initial conditional testing spec by @simon-mo in #6841
  • [LoRA] Relax LoRA condition by @jeejeelee in #7146
  • [Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
  • [BugFix] Fix DeepSeek remote code by @dsikka in #7178
  • [ BugFix ] Fix ZMQ when VLLM_PORT is set by @robertgshaw2-neuralmagic in #7205
  • [Bugfix] add gguf dependency by @kpapis in #7198
  • [SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
  • [Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
  • [Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
  • [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
  • [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
  • [BugFix] Overhaul async request cancellation by @njhill in #7111
  • [Doc] Mock new dependencies for documentation by @ywang96 in #7245
  • [BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
  • [Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
  • [distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
  • [Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 by @dsikka in #5874
  • [ BugFix ] Move zmq frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222
  • Fixes typo in function name by @rafvasq in #7275
  • [Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
  • [OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
  • [Doc] add online speculative decoding example by @stas00 in #7243
  • [BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
  • [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
  • [ci] Make building wheels per commit optional by @khluu in #7278
  • [Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
  • [FrontEnd] Make merge_async_iterators is_cancelled arg optional by @njhill in #7282
  • [Doc] Update supported_hardware.rst by @mgoin in #7276
  • [Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
  • [Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
  • [Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
  • [Bugfix] Fix LoRA with PP by @andoorve in #7292
  • [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
  • [Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
  • [Frontend] Kill the server on engine death by @joerunde in #6594
  • [Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
  • [Doc] Put collect_env issue output in a block by @mgoin in #7310
  • [CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
  • [Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
  • [Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
  • Add Skywork AI as Sponsor by @simon-mo in #7314
  • [TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
  • [Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
  • [TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
  • [Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
  • [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
  • [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
  • [Core] Streamline stream termination in AsyncLLMEngine by @njhill in #7336
  • [Model][Jamba] Mamba cache single buffer by @mzusman in #6739
  • [VLM][Doc] Add stop_token_ids to InternVL example by @Isotr0py in #7354
  • [Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
  • [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
  • [Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
  • [Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
  • [Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
  • [Bugfix] Fix PerTensorScaleParameter weight loading for fused models by @dsikka in #7376
  • [Misc] Add numpy implementation of compute_slot_mapping by @Yard1 in #7377
  • [Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
  • [Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
  • [TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
  • Updating LM Format Enforcer version to v0.10.6 by @noamgat in https:/...
Read more

v0.5.4

05 Aug 22:38
4db5176
Compare
Choose a tag to compare

Highlights

Model Support

  • Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
  • Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
  • Added H2O Danube3-4b (#6451)
  • Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)

Hardware Support

  • TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
  • Intel CPU: enable multiprocessing and tensor parallelism (#6125)

Performance

We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

  • Separated OpenAI Server's HTTP request handling and model inference loop with zeromq. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883)
  • Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
  • Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
  • Optimize get_seqs function, bring 2% throughput enhancements. (#7051)

Production Features

  • Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
  • Refactor the punica kernel based on Triton (#5036)
  • Support for guided decoding for offline LLM (#6878)

Quantization

  • Support W4A8 quantization for vllm (#5218)
  • Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
  • Support reading bitsandbytes pre-quantized model (#5753)

What's Changed

  • [Docs] Announce llama3.1 support by @WoosukKwon in #6688
  • [doc][distributed] fix doc argument order by @youkaichao in #6691
  • [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
  • [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
  • Bump version to 0.5.3.post1 by @simon-mo in #6696
  • [Misc] Add ignored layers for fp8 quantization by @mgoin in #6657
  • [Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in #6652
  • [Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in #6519
  • Bump transformers version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in #6690
  • [build] relax wheel size limit by @youkaichao in #6704
  • [CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in #6702
  • [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in #6645
  • [bitsandbytes]: support read bnb pre-quantized model by @thesues in #5753
  • [Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in #6708
  • [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in #6714
  • [Bugfix] Fix token padding for chameleon by @ywang96 in #6724
  • [Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in #6680
  • [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in #6711
  • [Bugfix]fix modelscope compatible issue by @liuyhwangyh in #6730
  • Adding f-string to validation error which is missing by @luizanao in #6748
  • [Bugfix] Fix speculative decode seeded test by @njhill in #6743
  • [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in #6686
  • [Frontend] split run_server into build_server and run_server by @dtrifiro in #6740
  • [Kernels] Add fp8 support to reshape_and_cache_flash by @Yard1 in #6667
  • [Core] Tweaks to model runner/input builder developer APIs by @Yard1 in #6712
  • [Bugfix] Bump transformers to 4.43.2 by @mgoin in #6752
  • [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in #6754
  • [core][distributed] fix zmq hang by @youkaichao in #6759
  • [Frontend] Represent tokens with identifiable strings by @ezliu in #6626
  • [Model] Adding support for MiniCPM-V by @HwwwwwwwH in #4087
  • [Bugfix] Fix decode tokens w. CUDA graph by @comaniac in #6757
  • [Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in #6745
  • [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in #6755
  • [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in #6787
  • [ Misc ] fp8-marlin channelwise via compressed-tensors by @robertgshaw2-neuralmagic in #6524
  • [Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints by @mgoin in #6761
  • [Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in #6788
  • [Doc] Add documentations for nightly benchmarks by @KuntaiDu in #6412
  • [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in #6798
  • [doc][distributed] improve multinode serving doc by @youkaichao in #6804
  • [Docs] Publish 5th meetup slides by @WoosukKwon in #6799
  • [Core] Fix ray forward_dag error mssg by @rkooo567 in #6792
  • [ci][distributed] fix flaky tests by @youkaichao in #6806
  • [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in #6810
  • Fix ReplicatedLinear weight loading by @qingquansong in #6793
  • [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in #6770
  • [Core] Use array to speedup padding by @peng1999 in #6779
  • [doc][debugging] add known issues for hangs by @youkaichao in #6816
  • [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in #6611
  • [Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in #6838
  • [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in #6811
  • [Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in #6812
  • [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in #6125
  • [Doc] Add Nemotron to supported model docs by @mgoin in #6843
  • [Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in #4283
  • Update README.md by @gurpreet-dhami in #6847
  • enforce eager mode with bnb quantization temporarily by @chenqianfzh in #6846
  • [TPU] Support collective communications in XLA devices by @WoosukKwon in #6813
  • [Frontend] Factor out code for running uvicorn by @DarkLight1337 in #6828
  • [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in #6852
  • [Bugfix]: Fix Tensorizer test failures by @sangstar in #6835
  • [ROCm] Upgrade PyTorch nightly version by @WoosukKwon in #6845
  • [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in #6844
  • [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in #6784
  • [Model] H2O Danube3-4b by @g-eoj in #6451
  • [Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in #5871
  • [Doc] Add missing mock import to docs conf.py by @hmellor in #6834
  • [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in https://github.com...
Read more

v0.5.3.post1

23 Jul 17:09
38c4b7e
Compare
Choose a tag to compare

Highlights

  • We fixed an configuration incompatibility between vLLM (which tested against pre-released version) and the published Meta Llama 3.1 weights (#6693)

What's Changed

Full Changelog: v0.5.3...v0.5.3.post1

v0.5.3

23 Jul 07:01
bb2fc08
Compare
Choose a tag to compare

Highlights

Model Support

  • vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
    • Please checkout this thread for any known issues related to the model.
    • The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
    • The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
    • In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
  • Support Mistral-Nemo (#6548)
  • Support Chameleon (#6633, #5770)
  • Pipeline parallel support for Mixtral (#6516)

Hardware Support

Performance Enhancements

  • Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
  • Progress towards refactoring for SPMD worker execution. (#6032)
  • Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
  • Memory optimization for pipeline parallelism. (#6455)

Production Engine

  • Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
  • Support dynamically loading Lora adapter from HuggingFace (#6234)
  • Pipeline Parallel using stdlib multiprocessing module (#6130)

Others

  • A CPU offloading implementation, you can now use --cpu-offload-gb to control how much memory to "extend" the RAM with. (#6496)
  • The new vllm CLI is now ready for testing. It comes with three commands: serve, complete, and chat. Feedback and improvements are greatly welcomed! (#6431)
  • The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)

What's Changed

  • [Docs] Add Google Cloud to sponsor list by @WoosukKwon in #6450
  • [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in #6289
  • [CI/Build][TPU] Add TPU CI test by @WoosukKwon in #6277
  • Pin sphinx-argparse version by @khluu in #6453
  • [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in #6425
  • [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in #6419
  • [Docs] Announce 5th meetup by @WoosukKwon in #6458
  • [CI/Build] vLLM cache directory for images by @DarkLight1337 in #6444
  • [Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in #5923
  • [Misc] Fix typos in spec. decode metrics logging. by @tdoublep in #6470
  • [Core] Use numpy to speed up padded token processing by @peng1999 in #6442
  • [CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in #6460
  • [doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in #6481
  • [Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in #6457
  • [Doc] Fix the lora adapter path in server startup script by @Jeffwan in #6230
  • [Misc] Log spec decode metrics by @comaniac in #6454
  • [Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale by @mgoin in #6081
  • [ci][distributed] add pipeline parallel correctness test by @youkaichao in #6410
  • [misc][distributed] improve tests by @youkaichao in #6488
  • [misc][distributed] add seed to dummy weights by @youkaichao in #6491
  • [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in #6455
  • [ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in #6482
  • [Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in #6467
  • [Doc][CI/Build] Update docs and tests to use vllm serve by @DarkLight1337 in #6431
  • [Bugfix] Fix for multinode crash on 4 PP by @andoorve in #6495
  • [TPU] Remove multi-modal args in TPU backend by @WoosukKwon in #6504
  • [Misc] Use torch.Tensor for type annotation by @WoosukKwon in #6505
  • [Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in #6164
  • [DOC] - Add docker image to Cerebrium Integration by @milo157 in #6510
  • [Bugfix] Fix Ray Metrics API usage by @Yard1 in #6354
  • [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in #6338
  • [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in #6511
  • [Model] Pipeline parallel support for Mixtral by @comaniac in #6516
  • [ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in #6487
  • [core][model] yet another cpu offload implementation by @youkaichao in #6496
  • [BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in #6530
  • [Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in #6032
  • [Misc] Minor patch for draft model runner by @comaniac in #6523
  • [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in #6227
  • [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in #6501
  • [TPU] Refactor TPU worker & model runner by @WoosukKwon in #6506
  • [ Misc ] Improve Min Capability Checking in compressed-tensors by @robertgshaw2-neuralmagic in #6522
  • [ci] Reword Github bot comment by @khluu in #6534
  • [Model] Support Mistral-Nemo by @mgoin in #6548
  • Fix PR comment bot by @khluu in #6554
  • [ci][test] add correctness test for cpu offloading by @youkaichao in #6549
  • [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in #6552
  • [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in #6517
  • Add support for a rope extension method by @simon-mo in #6553
  • [Core] Multiprocessing Pipeline Parallel support by @njhill in #6130
  • [Bugfix] Make spec. decode respect per-request seed. by @tdoublep in #6034
  • [ Misc ] non-uniform quantization via compressed-tensors for Llama by @robertgshaw2-neuralmagic in #6515
  • [Bugfix][Frontend] Fix missing /metrics endpoint by @DarkLight1337 in #6463
  • [BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in #6369
  • [Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in #6327
  • [Bugfix][Frontend] remove duplicate init logger by @dtrifiro in #6581
  • [Misc] Small perf improvements by @Yard1 in #6520
  • [Docs] Update docs for wheel location by @simon-mo in #6580
  • [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in #6578
  • [bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in #6597
  • [ Kernel ] Enable Dynamic Per Token fp8 by @robertgshaw2-neuralmagic in #6547
  • [Docs] Update PP docs by @andoorve in #6598
  • [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in #6599
  • [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in #6593
  • [Core] Allow specifying custom Executor by @Yard1 in #6557
  • [Bugfix][Core]: Guard for KeyErrors that...
Read more

v0.5.2

15 Jul 18:01
4cf256a
Compare
Choose a tag to compare

Major Changes

  • ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
  • The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Hardware

  • AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

  • ZeroMQ fallback for broadcasting large objects (#6183)
  • Simplify code to support pipeline parallel (#6406)
  • Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
  • Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

  • Enabling bonus token in speculative decoding for KV cache based models (#5765)
  • Medusa Implementation with Top-1 proposer (#4978)
  • An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

  • Add support for multi-node on CI (#5955)
  • Benchmark: add H100 suite (#6047)
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
  • Build some nightly wheels (#6380)

What's Changed

  • Update wheel builds to strip debug by @simon-mo in #6161
  • Fix release wheel build env var by @simon-mo in #6162
  • Move release wheel env var to Dockerfile instead by @simon-mo in #6163
  • [Doc] Reorganize Supported Models by Type by @ywang96 in #6167
  • [Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in #6168
  • [Model] Add PaliGemma by @ywang96 in #5189
  • add benchmark for fix length input and output by @haichuan1221 in #5857
  • [ Misc ] Support Fp8 via llm-compressor by @robertgshaw2-neuralmagic in #6110
  • [misc][frontend] log all available endpoints by @youkaichao in #6195
  • do not exclude object field in CompletionStreamResponse by @kczimm in #6196
  • [Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in #5947
  • [Kernel] reloading fused_moe config on the last chunk by @avshalomman in #6210
  • [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in #4888
  • [Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in #6203
  • [Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in #6214
  • Add FlashInfer to default Dockerfile by @simon-mo in #6172
  • [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in #6216
  • [core][distributed] fix ray worker rank assignment by @youkaichao in #6235
  • [Bugfix][TPU] Add missing None to model input by @WoosukKwon in #6245
  • [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in #6256
  • Add support for multi-node on CI by @khluu in #5955
  • [CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in #4645
  • [Docs] Docs update for Pipeline Parallel by @andoorve in #6222
  • [Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in #6238
  • [Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in #4978
  • [core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in #6183
  • [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in #6279
  • [Doc] Guide for adding multi-modal plugins by @DarkLight1337 in #6205
  • [Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in #6287
  • [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in #6303
  • [CI/Build] Enable mypy typing for remaining folders by @bmuskalla in #6268
  • [Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in #6296
  • [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in #5765
  • [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in #6313
  • [Doc] Remove comments incorrectly copied from another project by @daquexian in #6286
  • [Doc] Update description of vLLM support for CPUs by @DamonFool in #6003
  • [BugFix]: set outlines pkg version by @xiangyang-95 in #6262
  • [Bugfix] Fix snapshot download in serving benchmark by @ywang96 in #6318
  • [Misc] refactor(config): clean up unused code by @aniaan in #6320
  • [BugFix]: fix engine timeout due to request abort by @pushan01 in #6255
  • [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in #6326
  • [BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in #6266
  • [ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in #6082
  • Benchmark: add H100 suite by @simon-mo in #6047
  • [bug fix] Fix llava next feature size calculation. by @xwjiang2010 in #6339
  • [doc] update pipeline parallel in readme by @youkaichao in #6347
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in #5362
  • [ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in #6223
  • [Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in #6351
  • [distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in #6346
  • [Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in #6349
  • [Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in #6343
  • [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in #6350
  • [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in #6352
  • [ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in #6353
  • [Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in #6364
  • [ Misc ] Support Models With Bias in compressed-tensors integration by @robertgshaw2-neuralmagic in #6356
  • [Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in #6367
  • [Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in #6379
  • [Misc] add fixture to guided processor tests by @kevinbu233 in #6341
  • [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in #6365
  • [ci] Add GHA workflows to enable full CI run by @khluu in #6381
  • [MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in #5327
  • Build some nightly wheels by default by @simon-mo in #6380
  • Fix release-pipeline.yaml by @simon-mo in #6388
  • Fix interpolation in release pipeline by @simon-mo in #6389
  • Fix release pipeline's -e flag by @simon-mo in #6390
  • [Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in #6382
  • [Misc] Add generated git commit hash as vllm.__commit__ by @mgoin in #6386
  • Fix release pipeline's dir permission by @simon-mo in #6391
  • [Bugfix][TPU] Fix megacore setting...
Read more

v0.5.1

05 Jul 19:47
79d406e
Compare
Choose a tag to compare

Highlights

  • vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with --pipeline-parallel-size. This feature is in early stage, please let us know your feedback.

Model Support

  • Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
  • Support Jamba (#4115). This is vLLM's first state space model!
  • Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
  • Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
    • Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in <image> into the prompt instead of complicated prompt formatting. See more here
    • There is also a new guide on adding VLMs! We would love your contribution for new models!

Hardware Support

Production Service

  • Support for sharded tensorized models (#4990)
  • Continous streaming of OpenAI response token stats (#5742)

Performance

  • Enhancement in distributed communication via shared memory (#5399)
  • Latency enhancement in block manager (#5584)
  • Enhancements to compressed-tensors supporting Marlin, W4A16 (#5435, #5385)
  • Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
  • Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
  • Speculative Decoding
  • Draft Model Runner (#5799)

Development Productivity

  • Post merge benchmark is now available at perf.vllm.ai!
  • Addition of A100 in CI environment (#5658)
  • Step towards nightly wheel publication (#5610)

What's Changed

Read more

v0.5.0.post1

14 Jun 02:43
50eed24
Compare
Choose a tag to compare

Highlights

  • Add initial TPU integration (#5292)
  • Fix crashes when using FlashAttention backend (#5478)
  • Fix issues when using num_devices < num_available_devices (#5473)

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.5.0.post1

v0.5.0

11 Jun 18:16
8f89d72
Compare
Choose a tag to compare

Highlights

Production Features

Hardware Support

  • Improvements to the Intel CPU CI (#4113, #5241)
  • Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)

Others

  • Debugging tips documentation (#5409, #5430)
  • Dynamic Per-Token Activation Quantization (#5037)
  • Customizable RoPE theta (#5197)
  • Enable passing multiple LoRA adapters at once to generate() (#5300)
  • OpenAI tools support named functions (#5032)
  • Support stream_options for OpenAI protocol (#5319, #5135)
  • Update Outlines Integration from FSM to Guide (#4109)

What's Changed

  • [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
  • [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
  • [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
  • [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
  • [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
  • [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
  • [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
  • [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
  • [BugFix] Prevent LLM.encode for non-generation Models by @robertgshaw2-neuralmagic in #5184
  • Update test_ignore_eos by @simon-mo in #4898
  • [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
  • [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
  • [Misc] Simplify code and fix type annotations in conftest.py by @DarkLight1337 in #5118
  • [Core] Support image processor by @DarkLight1337 in #4197
  • [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
  • [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
  • [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
  • [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
  • New CI template on AWS stack by @khluu in #5110
  • [FRONTEND] OpenAI tools support named functions by @br3no in #5032
  • [Bugfix] Support prompt_logprobs==0 by @toslunar in #5217
  • [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
  • [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
  • [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
  • [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
  • [CI/Build] Add inputs tests by @DarkLight1337 in #5215
  • [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
  • [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
  • [CI/Build] Simplify model loading for HfRunner by @DarkLight1337 in #5251
  • [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
  • [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
  • [Misc] Add transformers version to collect_env.py by @mgoin in #5259
  • [Misc] update collect env by @youkaichao in #5261
  • [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
  • [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
  • [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
  • [Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) by @tomeras91 in #5278
  • [CI] Add nightly benchmarks by @simon-mo in #5260
  • [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
  • [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
  • [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
  • [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
  • [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
  • [Docs] Add Sequoia as sponsors by @simon-mo in #5287
  • [Speculative Decoding] Add ProposerWorkerBase abstract class by @njhill in #5252
  • [BugFix] Fix log message about default max model length by @njhill in #5284
  • [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
  • [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
  • [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
  • [Docs] Add Ray Summit CFP by @simon-mo in #5295
  • [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
  • [Frontend][Core] Update Outlines Integration from FSM to Guide by @br3no in #4109
  • [CI/Build] Update vision tests by @DarkLight1337 in #5307
  • Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
  • [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
  • [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
  • [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
  • [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
  • [Misc] Missing error message for custom ops import by @DamonFool in #5282
  • [Feature][Frontend]: Add support for stream_options in ChatCompletionRequest by @Etelis in #5135
  • [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
  • [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
  • Remove Ray health check by @Yard1 in #4693
  • Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
  • [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
  • [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
  • [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
  • fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in https://github.com/vllm-...
Read more