Best practice to train on multiple datasets with different prompts #2945

ShengYun-Peng · 2024-09-19T14:46:04Z

Thanks for the great work on facilitating the text embedding community!

I plan to train the instructor and other llm-based encoder models on multiple datasets. Since all of these models rely on different prompts on different embedding tasks. I'm curious what is the best way to prepend the prompt to the training dataset.

use dataset.map to prepend different prompts for different datasets
change model.default_prompt_name in each batch according to the task

tomaarsen · 2024-09-19T14:53:18Z

Hello!

Good question - this isn't clearly mentioned in the documentation anywhere.

The default_prompt_name, prompt_name and prompt options only affect the final inference of a trained model via model.encode.
So, if you want to train a model that can "understand" certain prompts that you want your users to apply, then you should use dataset.map to add those prompts to your training dataset.

Best of luck!

Tom Aarsen

ShengYun-Peng · 2024-09-19T15:05:48Z

Thank you for the quick response! After tracing the code, I notice that the entire encode function is not called in the training pipeline and the forward function of nn.Sequential is actually being called in the loss function, thus I will take your suggestion and implement the prompt logic while loading the dataset.

ShengYun-Peng · 2024-09-19T15:24:30Z

A quick follow-up question on this: How do I exclude prompts in computing the mean embedding in the above scenario?

tomaarsen · 2024-09-19T15:55:41Z

Hmm, I hadn't considered that yet. Via model.encode you can exclude it by setting include_prompt to False in the Pooling module, and then this section will trigger if someone passes a prompt:
https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Pooling.py#L140-L141

For training, I think the easiest will be to write a custom Pooling module, e.g.:

from __future__ import annotations

import json
import os
from typing import Any

import torch
from torch import Tensor, nn


class PoolingExcludingPrompts(nn.Module):
    """
    A pooling layer that computes the mean sentence embedding from a sequence of token embeddings,
    excluding the prompt tokens.
    """
    def __init__(self, word_embedding_dimension: int) -> None:
        super().__init__()
        self.word_embedding_dimension = word_embedding_dimension

    def forward(self, features: dict[str, Tensor]) -> dict[str, Tensor]:
        token_embeddings = features["token_embeddings"]
        attention_mask = (
            features["attention_mask"]
            if "attention_mask" in features
            else torch.ones(token_embeddings.shape[:-1], device=token_embeddings.device, dtype=torch.int64)
        )

        # Detect your model's prompt(s) and remove them from the attention_mask
        ...

        input_mask_expanded = (
            attention_mask.unsqueeze(-1).expand(token_embeddings.size()).to(token_embeddings.dtype)
        )
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)

        # If tokens are weighted (by WordWeights layer), feature 'token_weights_sum' will be present
        if "token_weights_sum" in features:
            sum_mask = features["token_weights_sum"].unsqueeze(-1).expand(sum_embeddings.size())
        else:
            sum_mask = input_mask_expanded.sum(1)

        sum_mask = torch.clamp(sum_mask, min=1e-9)

        features["sentence_embedding"] = sum_embeddings / sum_mask
        return features

    def get_sentence_embedding_dimension(self) -> int:
        return self.word_embedding_dimension

    def get_config_dict(self) -> dict[str, Any]:
        return {"word_embedding_dimension": self.word_embedding_dimension}

    def save(self, output_path) -> None:
        with open(os.path.join(output_path, "config.json"), "w") as fOut:
            json.dump(self.get_config_dict(), fOut, indent=2)

    @staticmethod
    def load(input_path) -> "PoolingExcludingPrompts":
        with open(os.path.join(input_path, "config.json")) as fIn:
            config = json.load(fIn)

        return PoolingExcludingPrompts(**config)

And then after the model is trained, you should be able to use the "normal" Pooling with include_prompt=False. I would do a double-check to make sure that the same tokens are ignored so you don't get a discrepancy.

Otherwise, you can also keep your custom Pooling in the final trained model, but then your users will have to use trust_remote_code=True. Read more about creating Custom Modules here.

Tom Aarsen

ShengYun-Peng · 2024-09-19T19:21:39Z

Thank you! I will try out the customized pooling method you provided.

~~Another clarification question: The "prompt logic" discussed above is also not supported by all evaluators, right?~~

Nvm, I figure it out. To whoever is curious about the solution. Evaluators are calling model.encode, thus by setting default prompt in model will automatically load the instruction.

tomaarsen · 2024-09-24T16:08:00Z

Apologies, I missed your last question! Yes indeed, and some evaluators don't yet support a prompt/prompt_name argument. #2951 should improve that.

Tom Aarsen

ArthurCamara · 2024-09-27T09:54:19Z

@ShengYun-Peng @tomaarsen, I just created #2964 that adds prompts to the trainer and masking accordingly. Let me know what you think!

ShengYun-Peng closed this as completed Sep 24, 2024

ArthurCamara mentioned this issue Sep 27, 2024

[feat] Trainer with prompts and prompt masking #2964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice to train on multiple datasets with different prompts #2945

Best practice to train on multiple datasets with different prompts #2945

ShengYun-Peng commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024 •

edited

Loading

tomaarsen commented Sep 24, 2024

ArthurCamara commented Sep 27, 2024

Best practice to train on multiple datasets with different prompts #2945

Best practice to train on multiple datasets with different prompts #2945

Comments

ShengYun-Peng commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

ShengYun-Peng commented Sep 19, 2024 • edited Loading

tomaarsen commented Sep 24, 2024

ArthurCamara commented Sep 27, 2024

ShengYun-Peng commented Sep 19, 2024 •

edited

Loading