Skip to content

Conversation

@JJJYmmm
Copy link
Contributor

@JJJYmmm JJJYmmm commented Nov 7, 2025

What does this PR do?

fix qwen3vl expand generation with video

related issues: QwenLM/Qwen3-VL#1769, QwenLM/Qwen3-VL#1621

@zucchini-nlp

test script:

from transformers import AutoModelForImageTextToText, AutoProcessor

ckpt_path = "Qwen/Qwen3-VL-30B-A3B-Instruct"

# default: Load the model on the available device(s)
model = AutoModelForImageTextToText.from_pretrained(
    ckpt_path, dtype="bfloat16", device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(ckpt_path)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
                "video": "space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
num_return_sequences = 2
generated_ids = model.generate(**inputs, max_new_tokens=128, num_return_sequences=num_return_sequences)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip([inputs.input_ids] * num_return_sequences, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

🫡 test pass with

RUN_SLOW=True pytest tests/models/qwen3_vl_moe/test_modeling_qwen3_vl_moe.py::Qwen3VLMoeIntegrationTest

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_vl, qwen3_vl_moe

Comment on lines +474 to +476
inputs = self.processor.apply_chat_template(
self.message3, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
).to(torch_device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ig we need to do either beam search or sampling with num_return_sequences > 1 to tigger the needed behavior, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the test uses num_beams=2 and num_return_sequences=2.

output = model.generate(**inputs, max_new_tokens=30, do_sample=False, num_beams=2, num_return_sequences=2)

@zucchini-nlp
Copy link
Member

run-slow: qwen3_vl_moe

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/qwen3_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants