Skip to content

OpenEuroLLM/llm-judge-eval

Repository files navigation

LLM-Judge evaluation

The main use-cases of this packages are:

  • evaluating one model easily against another on Alpaca-Eval, Arena-Hard, m-Arena-Hard or other benchmarks
  • easily swap judge model among self-hosted options with vLLM or remote options with Toegether or OpenAI

For generation and LLM-judge any model available in LangChain should be usable in theory, so far LlamaCpp, vLLM, Together and OpenAI have been tested which should already cover a wide set of use-cases.

Installation instructions.

git clone https://github.com/geoalgo/llm-judge-eval
cd llm-judge-eval
uv sync 
uv sync --extra vllm   # if you need vllm
export LLM_JUDGE_EVAL_DATA=~/llm-judge-eval-data/  # where data is downloaded
python -c "from llmjudgeeval.utils import download_all; download_all()"  # if you need to download all datasets at once

Evaluate a model.

To evaluate a model, run something like this:

python llmjudgeeval/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
--model_B gpt4_1106_preview \
--judge_model Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
--n_instructions 10 

This will compare the completions of Together/meta-llama/Llama-3.3-70B-Instruct-Turbo with gpt4_1106_preview by the judge Together/meta-llama/Llama-3.3-70B-Instruct-Turbo. Completions for model_A will be generated if they are not present already and will then compare with completions of gpt4_1106_preview available in alpaca-eval.

To choose a model, you need to pass first the LangChain backend (LlamaCpp, ChatOpenAI, VLLM, Together...). Here are examples with different providers:

  • Together/meta-llama/Llama-3.3-70B-Instruct-Turbo
  • ChatOpenAI/gpt-5-nano
  • LlamaCpp/jwiggerthale_Llama-3.2-3B-Q8_0-GGUF_llama-3.2-3b-q8_0.gguf
  • VLLM/MultiSynt/nemotron-cc-german-9b
python llmjudgeeval/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A VLLM/MultiSynt/nemotron-cc-german-9b \
--model_B VLLM/MultiSynt/nemotron-cc-german-tower72b \
--judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
--n_instructions 10 

You should make sure that you have the extra-dependencies of Langchain installed.

Dataset supported.

The following datasets are supported:

  • "alpaca-eval"
  • "arena-hard"
  • "m-arena-hard"
  • "m-arena-hard-{lang}" where lang is one of the language supported by m-Arena-Hard (e.g. "ar", "cs", "de") or "EU" to run on all EU languages
  • {lang}-contexts where lang is one of "finnish", "french", "german", "spanish", "swedish"

If you are running in a Slurm setup without internet access on compute nodes, you may want to pre-download all datasets locally by running:

python -c "from llmjudgeeval.utils import download_all; download_all()"

The datasets will be downloaded under $LLM_JUDGE_EVAL_DATA if the environment variable is specified and ~/llm-judge-eval-data/ otherwise.

TODOs:

  • support evaluation with input swap
  • handle errors
  • CI [high/large]
  • implement CI judge option
  • implement domain filter in CI (maybe pass a regexp by column?)
  • report cost?

Done:

  • support alpaca-eval
  • support arena-hard
  • test together judge
  • local env variable to set paths
  • tqdm callback with batch
  • support loading local completions
  • support dumping outputs [medium/small]
  • test LlamaCpp [medium/small]
  • test openai judge [medium/small]
  • test vLLM judge [medium/small]
  • CLI launcher [medium/large]
  • put contexts in HF dataset [high/small]
  • mAH: instruction loader [DONE]
  • mAH: generate instructions for two models [DONE]
  • mAH: make comparison [DONE]
  • mAH: support using all languages at once [high/medium]
  • unit-test
  • add details to example to generate and evaluate completions
  • installation instructions

About

A repository to evaluate model with LLM judges

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages