Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice to train on multiple datasets with different prompts #2945

Closed
ShengYun-Peng opened this issue Sep 19, 2024 · 7 comments
Closed

Comments

@ShengYun-Peng
Copy link

Thanks for the great work on facilitating the text embedding community!

I plan to train the instructor and other llm-based encoder models on multiple datasets. Since all of these models rely on different prompts on different embedding tasks. I'm curious what is the best way to prepend the prompt to the training dataset.

  1. use dataset.map to prepend different prompts for different datasets
  2. change model.default_prompt_name in each batch according to the task
@tomaarsen
Copy link
Collaborator

Hello!

Good question - this isn't clearly mentioned in the documentation anywhere.

The default_prompt_name, prompt_name and prompt options only affect the final inference of a trained model via model.encode.
So, if you want to train a model that can "understand" certain prompts that you want your users to apply, then you should use dataset.map to add those prompts to your training dataset.

Best of luck!

  • Tom Aarsen

@ShengYun-Peng
Copy link
Author

Thank you for the quick response! After tracing the code, I notice that the entire encode function is not called in the training pipeline and the forward function of nn.Sequential is actually being called in the loss function, thus I will take your suggestion and implement the prompt logic while loading the dataset.

@ShengYun-Peng
Copy link
Author

A quick follow-up question on this: How do I exclude prompts in computing the mean embedding in the above scenario?

@tomaarsen
Copy link
Collaborator

Hmm, I hadn't considered that yet. Via model.encode you can exclude it by setting include_prompt to False in the Pooling module, and then this section will trigger if someone passes a prompt:
https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Pooling.py#L140-L141

For training, I think the easiest will be to write a custom Pooling module, e.g.:

from __future__ import annotations

import json
import os
from typing import Any

import torch
from torch import Tensor, nn


class PoolingExcludingPrompts(nn.Module):
    """
    A pooling layer that computes the mean sentence embedding from a sequence of token embeddings,
    excluding the prompt tokens.
    """
    def __init__(self, word_embedding_dimension: int) -> None:
        super().__init__()
        self.word_embedding_dimension = word_embedding_dimension

    def forward(self, features: dict[str, Tensor]) -> dict[str, Tensor]:
        token_embeddings = features["token_embeddings"]
        attention_mask = (
            features["attention_mask"]
            if "attention_mask" in features
            else torch.ones(token_embeddings.shape[:-1], device=token_embeddings.device, dtype=torch.int64)
        )

        # Detect your model's prompt(s) and remove them from the attention_mask
        ...

        input_mask_expanded = (
            attention_mask.unsqueeze(-1).expand(token_embeddings.size()).to(token_embeddings.dtype)
        )
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)

        # If tokens are weighted (by WordWeights layer), feature 'token_weights_sum' will be present
        if "token_weights_sum" in features:
            sum_mask = features["token_weights_sum"].unsqueeze(-1).expand(sum_embeddings.size())
        else:
            sum_mask = input_mask_expanded.sum(1)

        sum_mask = torch.clamp(sum_mask, min=1e-9)

        features["sentence_embedding"] = sum_embeddings / sum_mask
        return features

    def get_sentence_embedding_dimension(self) -> int:
        return self.word_embedding_dimension

    def get_config_dict(self) -> dict[str, Any]:
        return {"word_embedding_dimension": self.word_embedding_dimension}

    def save(self, output_path) -> None:
        with open(os.path.join(output_path, "config.json"), "w") as fOut:
            json.dump(self.get_config_dict(), fOut, indent=2)

    @staticmethod
    def load(input_path) -> "PoolingExcludingPrompts":
        with open(os.path.join(input_path, "config.json")) as fIn:
            config = json.load(fIn)

        return PoolingExcludingPrompts(**config)

And then after the model is trained, you should be able to use the "normal" Pooling with include_prompt=False. I would do a double-check to make sure that the same tokens are ignored so you don't get a discrepancy.

Otherwise, you can also keep your custom Pooling in the final trained model, but then your users will have to use trust_remote_code=True. Read more about creating Custom Modules here.

  • Tom Aarsen

@ShengYun-Peng
Copy link
Author

ShengYun-Peng commented Sep 19, 2024

Thank you! I will try out the customized pooling method you provided.

Another clarification question: The "prompt logic" discussed above is also not supported by all evaluators, right?

Nvm, I figure it out. To whoever is curious about the solution. Evaluators are calling model.encode, thus by setting default prompt in model will automatically load the instruction.

@tomaarsen
Copy link
Collaborator

Apologies, I missed your last question! Yes indeed, and some evaluators don't yet support a prompt/prompt_name argument. #2951 should improve that.

  • Tom Aarsen

@ArthurCamara
Copy link
Contributor

@ShengYun-Peng @tomaarsen, I just created #2964 that adds prompts to the trainer and masking accordingly. Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants