Skip to content

Conversation

kashif
Copy link
Collaborator

@kashif kashif commented Oct 13, 2025

What does this PR do?

  • Unified conversion logic for both images and videos
  • Parses <image> and <video> placeholder tags

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

Nice!

I'm a bit hesitant about one point — adding this would introduce a new logic for how we prepare multimodal data, and it would reduce the overall coherence of the codebase.

Currently, for VLMs, we support the following formats:

  1. Unstructured messages with images
    (the image or images are added before the first user message)

    messages = [{"role": "user", "content": "What is it?"},
                {"role": "assistant", "content": "It's a park"}]
    image = <PIL.Image>  # or images = [<PIL.Image>, <PIL.Image>]
  2. Structured messages

    messages = [{"role": "user", "content": [{"type": "text", "text": "What is it?"}, {"type": "image"}]},
                {"role": "assistant", "content": "It's a park"}]
    image = <PIL.Image>

Here, the proposal would be to add support for a third format:

  1. Unstructured messages with an inline image tag

    messages = [{"role": "user", "content": "<image> What is it?"},
                {"role": "assistant", "content": "It's a park"}]
    image = <PIL.Image>

It doesn’t seem to me that this third format is really necessary. I’d simply recommend that users with datasets in this form preprocess them to match one of the two supported formats.

prepare_multimodal_messages(example["prompt"] + example["completion"], len(example["images"]))
num_images = len(example.get("images", []))
num_videos = len(example.get("videos", []))
# Only prepare multimodal messages for images; videos use native <video> tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

videos use native

this doesn't make sense to me. For _collate_language_modeling we need to add {"type": "video"}, but not here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants