Skip to content

Conversation

martinscooper
Copy link
Collaborator

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

@lilacheden
Copy link
Member

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.

  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.

  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
    It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@martinscooper
Copy link
Collaborator Author

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.
  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.
  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
    It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@lilacheden

  1. With specific I mean that I think the instruction entry of the context fields should have a simpler field name by default. I would set the context_fields just as a list, which probably users wouldn't have to change:
{
  ...,
  "context_fields":  ["question", "instruction"],
  ...,
}

Then, if a user needs a more specific source from where the context field should be taken from, they could specify it manually for their use case.

  1. Agree.

  2. Sounds good. It is true that it is not that important to have all registered judges in the same file. Then I would move the judges back to its original file. Do you agree on calling get_evaluator_metadata() so that the params are consistent across all the judges?

data = {
"test": [
{
"question": "How is the weather?",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this examples changed? Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I will remove it

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>
@martinscooper
Copy link
Collaborator Author

@yoavkatz @elronbandel I applied the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants