Skip to content

Commit 5e39f91

Browse files
committed
update figures
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
1 parent 391612e commit 5e39f91

File tree

5 files changed

+24
-47
lines changed

5 files changed

+24
-47
lines changed

_posts/2025-11-30-vllm-omni.md

Lines changed: 24 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,18 @@
22

33
We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
44

5-
Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.
5+
<p align="center">
6+
<img src="/assets/figures/2025-11-30-vllm-omni/omni-modality-log-text-dark.png" alt="vllm-omni logo" width="80%">
7+
</p>
68

7-
**vLLM-Omni** answers this call, extending vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
89

9-
\<p align="center"\>
10-
\<img src="/assets/figures/vllm-omni-logo-text-dark.png" alt="vLLM Omni Logo" width="60%"\>
11-
\</p\>
10+
Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.
11+
12+
**vLLM-Omni** is the first open source framework to support omni-modality model serving that extends vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
13+
14+
<p align="center">
15+
<img src="/assets/figures/2025-11-30-vllm-omni/omni-modality-model-architecture.png" alt="omni-modality model architecture" width="80%">
16+
</p>
1217

1318
## **Why vLLM-Omni?**
1419

@@ -22,42 +27,26 @@ vLLM-Omni addresses three critical shifts in model architecture:
2227

2328
## **Inside the Architecture**
2429

25-
vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation.
26-
27-
\<p align="center"\>
28-
\<img src="/assets/figures/omni-modality-model-architecture.png" alt="Omni-modality model architecture" width="80%"\>
29-
\</p\>
30-
As shown above, the architecture unifies distinct phases:
30+
vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. As shown above, the architecture unifies distinct phases:
3131

3232
* **Modality Encoders:** Efficiently processing inputs (ViT, T5, etc.)
3333
* **LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage.
3434
* **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs.
3535

3636
### **Key Features**
3737

38-
* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.
39-
40-
# todo @liuhongsheng, add the vLLM-Omni architecture
41-
42-
43-
* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various Omni-Modality models including Qwen-Omni, Qwen-Image, SD models.
38+
<p align="center">
39+
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-user-interface.png" alt="vllm-omni user interface" width="80%">
40+
</p>
4441

42+
* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.
4543

46-
* **Performance:** We utilize pipelined stage execution to overlap computation, ensuring that while one stage is processing, others aren't idle.
47-
48-
# todo @zhoutaichang, please add a figure to illustrate the pipelined stage execution.
44+
* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various omni-modality models including Qwen-Omni, Qwen-Image, and other state-of-the-art models.
4945

50-
## **Performance**
46+
* **Performance:** We utilize pipelined stage execution to overlap computation for high throughput performance, ensuring that while one stage is processing, others aren't idle.
5147

5248
We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving.
5349

54-
| Metric | vLLM-Omni | HF Transformers | Improvement |
55-
| :---- | :---- | :---- | :---- |
56-
| **Throughput** (req/s) | **TBD** | TBD | **TBD x** |
57-
| **Latency** (TTFT ms) | **TBD** | TBD | **TBD x** |
58-
| **GPU Memory** (GB) | **TBD** | TBD | **TBD %** |
59-
60-
*Note: Benchmarks were run on \[Insert Hardware Specs\] using \[Insert Model Name\].*
6150

6251
## **Future Roadmap**
6352

@@ -69,34 +58,22 @@ vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support
6958
* **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency.
7059
* **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere.
7160

72-
Contributions and collabrations from the open source community are welcome.
7361

7462
## **Getting Started**
7563

76-
Getting started with vLLM-Omni is straightforward. The initial release is built on top of vLLM v0.11.0.
64+
Getting started with vLLM-Omni is straightforward. The initial vllm-omni v0.11.0rc release is built on top of vLLM v0.11.0.
7765

7866
### **Installation**
7967

80-
First, set up your environment:
81-
82-
\# Create a virtual environment
83-
uv venv \--python 3.12 \--seed
84-
source .venv/bin/activate
85-
86-
\# Install the base vLLM
87-
uv pip install vllm==0.11.0 \--torch-backend=auto
88-
89-
Next, install the vLLM-Omni extension:
90-
91-
git clone \[https://github.com/vllm-project/vllm-omni.git\](https://github.com/vllm-project/vllm-omni.git)
92-
cd vllm\_omni
93-
uv pip install \-e .
68+
Check out our [Installation Doc](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/) for details.
9469

95-
### **Running the Qwen3-Omni model**
70+
### **Serving the omni-modality models**
9671

97-
@huayongxiang, add the gradio example for Qwen3-Omni model inference
72+
Check out our [examples directory](https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. vLLM-Omni also provides the gradio support to improve user experience, below is a demo example for serving Qwen-Image:
9873

99-
Check out our [examples directory](https://www.google.com/search?q=https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows.
74+
<p align="center">
75+
<img src="/assets/figures/2025-11-30-vllm-omni/vllm-omni-gradio-serving-demo.png" alt="vllm-omni serving qwen-image with gradio" width="80%">
76+
</p>
10077

10178
## **Join the Community**
10279

51.5 KB
Loading
1.06 MB
Loading
39.8 KB
Loading
280 KB
Loading

0 commit comments

Comments
 (0)