You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-11-30-vllm-omni.md
+24-47Lines changed: 24 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,18 @@
2
2
3
3
We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
4
4
5
-
Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.
**vLLM-Omni** answers this call, extending vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
8
9
9
-
\<p align="center"\>
10
-
\<img src="/assets/figures/vllm-omni-logo-text-dark.png" alt="vLLM Omni Logo" width="60%"\>
11
-
\</p\>
10
+
Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.
11
+
12
+
**vLLM-Omni** is the first open source framework to support omni-modality model serving that extends vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
13
+
14
+
<palign="center">
15
+
<imgsrc="/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png"alt="omni-modality model architecture"width="80%">
16
+
</p>
12
17
13
18
## **Why vLLM-Omni?**
14
19
@@ -22,42 +27,26 @@ vLLM-Omni addresses three critical shifts in model architecture:
22
27
23
28
## **Inside the Architecture**
24
29
25
-
vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation.
26
-
27
-
\<p align="center"\>
28
-
\<img src="/assets/figures/omni-modality-model-architecture.png" alt="Omni-modality model architecture" width="80%"\>
29
-
\</p\>
30
-
As shown above, the architecture unifies distinct phases:
30
+
vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. As shown above, the architecture unifies distinct phases:
***LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage.
34
34
***Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs.
35
35
36
36
### **Key Features**
37
37
38
-
***Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.
39
-
40
-
# todo @liuhongsheng, add the vLLM-Omni architecture
41
-
42
-
43
-
***Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various Omni-Modality models including Qwen-Omni, Qwen-Image, SD models.
38
+
<palign="center">
39
+
<imgsrc="/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png"alt="vllm-omni user interface"width="80%">
40
+
</p>
44
41
42
+
***Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.
45
43
46
-
***Performance:** We utilize pipelined stage execution to overlap computation, ensuring that while one stage is processing, others aren't idle.
47
-
48
-
# todo @zhoutaichang, please add a figure to illustrate the pipelined stage execution.
44
+
***Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various omni-modality models including Qwen-Omni, Qwen-Image, and other state-of-the-art models.
49
45
50
-
##**Performance**
46
+
***Performance:** We utilize pipelined stage execution to overlap computation for high throughput performance, ensuring that while one stage is processing, others aren't idle.
51
47
52
48
We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving.
*Note: Benchmarks were run on \[Insert Hardware Specs\] using \[Insert Model Name\].*
61
50
62
51
## **Future Roadmap**
63
52
@@ -69,34 +58,22 @@ vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support
69
58
***Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency.
70
59
***Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere.
71
60
72
-
Contributions and collabrations from the open source community are welcome.
73
61
74
62
## **Getting Started**
75
63
76
-
Getting started with vLLM-Omni is straightforward. The initial release is built on top of vLLM v0.11.0.
64
+
Getting started with vLLM-Omni is straightforward. The initial vllm-omni v0.11.0rc release is built on top of vLLM v0.11.0.
Check out our [Installation Doc](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/) for details.
94
69
95
-
### **Running the Qwen3-Omni model**
70
+
### **Serving the omni-modality models**
96
71
97
-
@huayongxiang, add the gradio example for Qwen3-Omni model inference
72
+
Check out our [examples directory](https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. vLLM-Omni also provides the gradio support to improve user experience, below is a demo example for serving Qwen-Image:
98
73
99
-
Check out our [examples directory](https://www.google.com/search?q=https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows.
74
+
<palign="center">
75
+
<imgsrc="/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png"alt="vllm-omni serving qwen-image with gradio"width="80%">
0 commit comments