@@ -20,6 +20,102 @@ vLLM will therefore optimize throughput/latency on top of existing transformers
2020In this post, we’ll explore how vLLM leverages the transformers backend to combine ** flexibility**
2121with ** efficiency** , enabling you to deploy state-of-the-art models faster and smarter.
2222
23+ ## Updates
24+
25+ This section will hold all the updates that have taken place since the blog post was first released (11th April 2025).
26+
27+ ### Support for Vision Language Models (21st July 2025)
28+
29+ vLLM with the transformers backend now supports ** Vision Language Models** . When user adds ` model_impl="transformers" ` ,
30+ the correct class for text-only and multimodality will be deduced and loaded.
31+
32+ Here is how one can serve a multimodal model using the transformers backend.
33+ ``` bash
34+ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \
35+ --model_impl transformers \
36+ ```
37+
38+ To consume the model one can use the ` openai ` API like so:
39+ ``` python
40+ from openai import OpenAI
41+ openai_api_key = " EMPTY"
42+ openai_api_base = " http://localhost:8000/v1"
43+ client = OpenAI(
44+ api_key = openai_api_key,
45+ base_url = openai_api_base,
46+ )
47+ chat_response = client.chat.completions.create(
48+ model = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf" ,
49+ messages = [{
50+ " role" : " user" ,
51+ " content" : [
52+ {" type" : " text" , " text" : " What's in this image?" },
53+ {
54+ " type" : " image_url" ,
55+ " image_url" : {
56+ " url" : " http://images.cocodataset.org/val2017/000000039769.jpg" ,
57+ },
58+ },
59+ ],
60+ }],
61+ )
62+ print (" Chat response:" , chat_response)
63+ ```
64+
65+ You can also directly initialize the vLLM engine using the ` LLM ` API. Here is the same model being
66+ served using the ` LLM ` API.
67+
68+ ``` python
69+ from vllm import LLM , SamplingParams
70+ from PIL import Image
71+ import requests
72+ from transformers import AutoProcessor
73+
74+ model_id = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
75+ hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template
76+
77+ messages = [
78+ {
79+ " role" : " user" ,
80+ " content" : [
81+ {" type" : " image" , " url" : " dummy_image.jpg" },
82+ {" type" : " text" , " text" : " What is the content of this image?" },
83+ ],
84+ },
85+ ]
86+ prompt = hf_processor.apply_chat_template(messages, tokenize = False , add_generation_prompt = True )
87+ image = Image.open(
88+ requests.get(
89+ " http://images.cocodataset.org/val2017/000000039769.jpg" , stream = True
90+ ).raw
91+ )
92+
93+ # initialize the vlm using the `model_impl="transformers"`
94+ vlm = LLM(
95+ model = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf" ,
96+ model_impl = " transformers" ,
97+ )
98+
99+ outputs = vlm.generate(
100+ {
101+ " prompt" : prompt,
102+ " multi_modal_data" : {" image" : image},
103+ },
104+ sampling_params = SamplingParams(max_tokens = 100 )
105+ )
106+
107+ for o in outputs:
108+ generated_text = o.outputs[0 ].text
109+ print (generated_text)
110+
111+ # OUTPUTS:
112+ # In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a
113+ # cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place.
114+ #
115+ # On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display
116+ # of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings.
117+ ```
118+
23119## Transformers and vLLM: Inference in Action
24120
25121Let’s start with a simple text generation task using the ` meta-llama/Llama-3.2-1B ` model to see how
0 commit comments