[Quantization] support VPTQ #3879

wejoncy · 2025-02-26T08:02:08Z

Motivation

This PR is intended to add support for Extreme Low-bit Vector Post-Training Quantization (VPTQ) to vLLM.

VPTQ now support

Support VPTQ for sglang.
Models from hf-hub are supported

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

support vptq-kernel

initial commit for VPTQ

7792304

support vptq-kernel

wejoncy requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy and Ying1123 as code owners February 26, 2025 08:02

wejoncy marked this pull request as draft February 26, 2025 08:04