Unify llm judges into a single prepare file #1696

martinscooper · 2025-03-21T16:43:49Z

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

lilacheden · 2025-03-24T15:11:07Z

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)

all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).

It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py

it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.
"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}
Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

Hi @martinscooper ,

what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.
It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.
However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

martinscooper · 2025-04-01T16:33:00Z

Hi @martinscooper ,

what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.

It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.

However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@lilacheden

With specific I mean that I think the instruction entry of the context fields should have a simpler field name by default. I would set the context_fields just as a list, which probably users wouldn't have to change:

{
  ...,
  "context_fields":  ["question", "instruction"],
  ...,
}

Then, if a user needs a more specific source from where the context field should be taken from, they could specify it manually for their use case.

Agree.
Sounds good. It is true that it is not that important to have all registered judges in the same file. Then I would move the judges back to its original file. Do you agree on calling get_evaluator_metadata() so that the params are consistent across all the judges?

yoavkatz · 2025-04-06T06:53:37Z

examples/evaluate_llm_as_judge_direct_criteria_from_dataset.py

 data = {
    "test": [
        {
            "question": "How is the weather?",


Why was this examples changed? Is this intentional?

No, I will remove it

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

martinscooper · 2025-04-09T12:23:45Z

@yoavkatz @elronbandel I applied the fix.

martinscooper requested review from elronbandel and lilacheden March 21, 2025 16:43

yoavkatz reviewed Apr 6, 2025

View reviewed changes

Unify llm judges into a single prepare file

9d77d49

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

martinscooper force-pushed the llm-judge-prepare branch from 67103ef to 9d77d49 Compare April 8, 2025 13:13

Merge branch 'main' into llm-judge-prepare

e9fd365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify llm judges into a single prepare file #1696

Unify llm judges into a single prepare file #1696

Uh oh!

martinscooper commented Mar 21, 2025

Uh oh!

lilacheden commented Mar 24, 2025

Uh oh!

martinscooper commented Apr 1, 2025

Uh oh!

yoavkatz Apr 6, 2025

Uh oh!

martinscooper Apr 8, 2025

Uh oh!

martinscooper commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unify llm judges into a single prepare file #1696

Are you sure you want to change the base?

Unify llm judges into a single prepare file #1696

Uh oh!

Conversation

martinscooper commented Mar 21, 2025

Uh oh!

lilacheden commented Mar 24, 2025

Uh oh!

martinscooper commented Apr 1, 2025

Uh oh!

yoavkatz Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

martinscooper Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

martinscooper commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants