-
Notifications
You must be signed in to change notification settings - Fork 31.1k
fix qwen3vl video processor temporal padding frames #42083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@zucchini-nlp please check this~ |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR! Can you also update the Qwen2VL video processor with the same changes?
| T = stacked_videos.shape[1] | ||
| if pad := -T % temporal_patch_size: | ||
| repeats = stacked_videos[:, -1:].expand(-1, pad, -1, -1, -1) | ||
| stacked_videos = torch.cat((stacked_videos, repeats), dim=1) | ||
| B, T, C, H, W = stacked_videos.shape | ||
| num_frames, height, width = T, H, W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is needed if we are expanding it later, just before petchifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if resize is enabled and num_frames < temporal_patch_size, resize check will throw an error: this line: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py#L44.
so i think here is needed for cases such as one video has only one image.
| T = patches.shape[1] | ||
| if pad := -T % temporal_patch_size: | ||
| repeats = patches[:, -1:].expand(-1, pad, -1, -1, -1) | ||
| patches = torch.cat((patches, repeats), dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, I see that qwen2VL's video processor has the same issue. I thought it was fixed but apparently there was a regression. Can you update it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done~
did not change smart_resize.
the smart_resize in qwen2vl and qwen3vl video processor is not same, is it correct? @JJJYmmm
| def test_image_input(self): | ||
| for video_processing_class in self.video_processor_list: | ||
| video_processor_dict = self.video_processor_dict.copy() | ||
| video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096} | ||
| video_processor_dict["do_sample_frames"] = False | ||
| video_processor_dict["temporal_patch_size"] = 3 | ||
| video_processing = video_processing_class(**video_processor_dict) | ||
|
|
||
| n, w, h = 1, 64, 64 | ||
| video_inputs = [(np.random.randint(0, 256, (h, w, 3), dtype=np.uint8)) for _ in range(n)] | ||
|
|
||
| video_processed = video_processing(video_inputs, return_tensors="pt") | ||
| encoded_videos = video_processed[self.input_name] | ||
| self.assertEqual(list(encoded_videos.shape), [16, 2304]) | ||
|
|
||
| video_grid_thw = video_processed["video_grid_thw"] | ||
| self.assertEqual(video_grid_thw.tolist(), [[1, 4, 4]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of the same test as test_videos_PIL ig, so it is redundant. I think the below one for temporal patch size is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test case is: one video has just one frame. test_videos_pil is not exactly one frame. i update the test function name.
| def test_num_frames_equal_temporal_patch_size_plus_two(self): | ||
| for video_processing_class in self.video_processor_list: | ||
| video_processor_dict = self.video_processor_dict.copy() | ||
| video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, do we need to change the size? It should not be affecting the final results and keeping it small ensures that tests are run fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
down the size to 32 * 32, if smaller, smart_resize will change it too.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen2_vl, qwen3_vl |
What does this PR do?
expect qwen3vl video processor can process these two cases:
Fixes # (issue)
QwenLM/Qwen3-VL#1689