fix qwen3vl video processor temporal padding frames #42083

yaogang2060 · 2025-11-07T09:29:27Z

What does this PR do?

expect qwen3vl video processor can process these two cases:

num_frames is 1, and sample_frames is false.
num_frames > temporal_patch_size and num_frames % temporal_patch_size != 1

Fixes # (issue)
QwenLM/Qwen3-VL#1689

yaogang2060 · 2025-11-07T09:32:49Z

@zucchini-nlp please check this~

zucchini-nlp

Thanks a lot for the PR! Can you also update the Qwen2VL video processor with the same changes?

zucchini-nlp · 2025-11-07T10:12:55Z

src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

+                T = stacked_videos.shape[1]
+                if pad := -T % temporal_patch_size:
+                    repeats = stacked_videos[:, -1:].expand(-1, pad, -1, -1, -1)
+                    stacked_videos = torch.cat((stacked_videos, repeats), dim=1)
+                B, T, C, H, W = stacked_videos.shape
+                num_frames, height, width = T, H, W


i don't think this is needed if we are expanding it later, just before petchifying

if resize is enabled and num_frames < temporal_patch_size, resize check will throw an error: this line: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py#L44.
so i think here is needed for cases such as one video has only one image.

zucchini-nlp · 2025-11-07T10:17:40Z

src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

+            T = patches.shape[1]
+            if pad := -T % temporal_patch_size:
+                repeats = patches[:, -1:].expand(-1, pad, -1, -1, -1)
+                patches = torch.cat((patches, repeats), dim=1)


great, I see that qwen2VL's video processor has the same issue. I thought it was fixed but apparently there was a regression. Can you update it as well?

done~
did not change smart_resize.
the smart_resize in qwen2vl and qwen3vl video processor is not same, is it correct? @JJJYmmm

zucchini-nlp · 2025-11-07T10:19:52Z

tests/models/qwen3_vl/test_video_processing_qwen3_vl.py

+    def test_image_input(self):
+        for video_processing_class in self.video_processor_list:
+            video_processor_dict = self.video_processor_dict.copy()
+            video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096}
+            video_processor_dict["do_sample_frames"] = False
+            video_processor_dict["temporal_patch_size"] = 3
+            video_processing = video_processing_class(**video_processor_dict)
+
+            n, w, h = 1, 64, 64
+            video_inputs = [(np.random.randint(0, 256, (h, w, 3), dtype=np.uint8)) for _ in range(n)]
+
+            video_processed = video_processing(video_inputs, return_tensors="pt")
+            encoded_videos = video_processed[self.input_name]
+            self.assertEqual(list(encoded_videos.shape), [16, 2304])
+
+            video_grid_thw = video_processed["video_grid_thw"]
+            self.assertEqual(video_grid_thw.tolist(), [[1, 4, 4]])


Kind of the same test as test_videos_PIL ig, so it is redundant. I think the below one for temporal patch size is enough

this test case is: one video has just one frame. test_videos_pil is not exactly one frame. i update the test function name.

zucchini-nlp · 2025-11-07T10:21:16Z

tests/models/qwen3_vl/test_video_processing_qwen3_vl.py

+    def test_num_frames_equal_temporal_patch_size_plus_two(self):
+        for video_processing_class in self.video_processor_list:
+            video_processor_dict = self.video_processor_dict.copy()
+            video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096}


just for my understanding, do we need to change the size? It should not be affecting the final results and keeping it small ensures that tests are run fast

down the size to 32 * 32, if smaller, smart_resize will change it too.

github-actions · 2025-11-07T23:33:18Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_vl, qwen3_vl

yaogang2060 added 3 commits November 7, 2025 16:17

qwen3vl video process padding video frames

357e602

add two video processor test cases

ec89ddf

fix typo

de72cd2

zucchini-nlp reviewed Nov 7, 2025

View reviewed changes

yaogang2060 added 2 commits November 8, 2025 07:01

down test image size

f91ae0f

fix qwen2vl video processor t padding

f257f08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix qwen3vl video processor temporal padding frames #42083

fix qwen3vl video processor temporal padding frames #42083

yaogang2060 commented Nov 7, 2025

Uh oh!

yaogang2060 commented Nov 7, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Nov 7, 2025

Uh oh!

yaogang2060 Nov 7, 2025

Uh oh!

zucchini-nlp Nov 7, 2025

Uh oh!

yaogang2060 Nov 7, 2025

Uh oh!

zucchini-nlp Nov 7, 2025

Uh oh!

yaogang2060 Nov 7, 2025

Uh oh!

zucchini-nlp Nov 7, 2025

Uh oh!

yaogang2060 Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix qwen3vl video processor temporal padding frames #42083

Are you sure you want to change the base?

fix qwen3vl video processor temporal padding frames #42083

Conversation

yaogang2060 commented Nov 7, 2025

What does this PR do?

Uh oh!

yaogang2060 commented Nov 7, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants