Masking Issue with convert_mm_data_to_model_format

Hello, thank you for sharing this interesting work!
However, I noticed a potential issue in the code and wanted to open an issue for your attention.

During fine-tuning, I added the following to the trainer:

```
class MMTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, *args, **kwargs):
        tokenizer = self.tokenizer
        input_ids = inputs["input_ids"][0]
        labels = inputs["labels"][0]

        mask = labels != -100
        decoded = tokenizer.decode(input_ids[mask], skip_special_tokens=False)
        print("Decoded text: ", decoded)
```


With this, I found that for some samples, the tokens do not seem to be properly masked.

Instead, I tried modifying the function like this:

```
def convert_mm_data_to_model_format(processor, sample):
    conversation = [{
        "role": "user",
        "content": [{"type": "text", "text": sample["question"]}],
    }]
    if sample.get("image", None) is not None:
        conversation[0]["content"].insert(0, {"type": "image"})

    prompt_text = processor.apply_chat_template(conversation, add_generation_prompt=True)
    prompt_inputs = processor(
        text=prompt_text,
        images=[sample.get("image")] if sample.get("image") is not None else None,
        return_tensors="pt",
        padding=False,
        truncation=False,
    )
    num_question_tokens = prompt_inputs.input_ids.shape[1]

    conversation.append({
        "role": "assistant",
        "content": [{"type": "text", "text": sample["answer"]}],
    })
    full_text = processor.apply_chat_template(conversation, add_generation_prompt=False)

    return full_text, num_question_tokens
```


After this change, it seems that the tokens are masked correctly.
Hope this helps — it might be good to check and consider updating the code!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masking Issue with convert_mm_data_to_model_format #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Masking Issue with convert_mm_data_to_model_format #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions