[SFT] add support for unified conversion logic for both images and videos #4264

kashif · 2025-10-13T13:57:46Z

What does this PR do?

Unified conversion logic for both images and videos
Parses <image> and <video> placeholder tags

HuggingFaceDocBuilderDev · 2025-10-13T14:01:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-10-15T16:02:36Z

Nice!

I'm a bit hesitant about one point — adding this would introduce a new logic for how we prepare multimodal data, and it would reduce the overall coherence of the codebase.

Currently, for VLMs, we support the following formats:

Unstructured messages with images
(the image or images are added before the first user message)

messages = [{"role": "user", "content": "What is it?"},
            {"role": "assistant", "content": "It's a park"}]
image = <PIL.Image>  # or images = [<PIL.Image>, <PIL.Image>]

Structured messages

messages = [{"role": "user", "content": [{"type": "text", "text": "What is it?"}, {"type": "image"}]},
            {"role": "assistant", "content": "It's a park"}]
image = <PIL.Image>

Here, the proposal would be to add support for a third format:

Unstructured messages with an inline image tag

messages = [{"role": "user", "content": "<image> What is it?"},
            {"role": "assistant", "content": "It's a park"}]
image = <PIL.Image>

It doesn’t seem to me that this third format is really necessary. I’d simply recommend that users with datasets in this form preprocess them to match one of the two supported formats.

qgallouedec · 2025-10-15T16:07:11Z

trl/trainer/sft_trainer.py

-                prepare_multimodal_messages(example["prompt"] + example["completion"], len(example["images"]))
+                num_images = len(example.get("images", []))
+                num_videos = len(example.get("videos", []))
+                # Only prepare multimodal messages for images; videos use native <video> tags


videos use native

this doesn't make sense to me. For _collate_language_modeling we need to add {"type": "video"}, but not here?

add support for unified conversion logic for both images and videos

e5492cb

kashif requested a review from albertvillanova October 13, 2025 14:21

kashif added 2 commits October 13, 2025 16:29

use processor to truncate if max_length is set

fe4602e

Merge branch 'main' into sft-video

7e9c6e4

qgallouedec reviewed Oct 15, 2025

View reviewed changes

kashif added 3 commits October 15, 2025 18:35

remove formatting to user side

04cf031

remove unused import

9263a16

helper for structured data

043b223

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SFT] add support for unified conversion logic for both images and videos #4264

[SFT] add support for unified conversion logic for both images and videos #4264

kashif commented Oct 13, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 13, 2025

Uh oh!

qgallouedec commented Oct 15, 2025

Uh oh!

qgallouedec Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SFT] add support for unified conversion logic for both images and videos #4264

Are you sure you want to change the base?

[SFT] add support for unified conversion logic for both images and videos #4264

Conversation

kashif commented Oct 13, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 13, 2025

Uh oh!

qgallouedec commented Oct 15, 2025

Uh oh!

qgallouedec Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants