Skip to content

This is the source code of the paper "JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer".

Notifications You must be signed in to change notification settings

DataArcTech/JudgeAgent

Repository files navigation

JudgeAgent

This is the source code of the paper "JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer".

How to Run

1. Download Datasets

In our experiments, we use MedQA, MultiHop-RAG, and QuALITY.

Please download the dataset you use and put the dataset folder under ./data

2. Process Data

Run following code to process the data into formats JudgeAgent needs:

python process_data.py --data [MedQA | MultiHopRAG | QuALITY]
  • --data is the name of the dataset folder.

After running this code, you can see a new folder under the project's directory: ./processed_data/[data_name], and processed chunks./corpus_chunks.jsonl and questions ./questions.json in this directory.

3. Preprocess

JudgeAgent constructs context graphs, extracts entities of questions, and obtains embeddings before the formal evaluation process.

Before you run these codes, you must modify the MODEL_PARAMS in ./JudgeAgent/llm_params.py by adding your own LLM information, including API_KEY, base_url, model_name, etc.

Please run the code in the following steps.

step1. construct graphs and extract entities

python construct_graph.py --data [MedQA | ...] --model gpt
python get_question_entity.py --ta [MedQA | ...] --model gpt
  • --data is the name of the dataset folder.
  • --model is the LLM used to extract entities from text

You can run these two codes at the same time.

step2. obtain embeddings

python get_embeddings.py --data [MedQA | ...] --model qwen3 --dim 1024
  • --model is the text embedding model. We use the embedding service of Qwen to get the embeddings in our experiments.
  • --dim is the dimension of embeddings

step3. match the extracted entities with the entities on graph

python get_similar_entity_on_graph.py --data [MedQA | ...]

This code match the extracted entities with the entities on context graph by the similarity calculated from the embeddings obtained in step2.

4. Evaluation

For efficiency, JudgeAgent synthetises questions before evaluating the target LLM.

Please run the codes in the following steps.

step1. synthetise questions

python synthetise_questions.py \
--data [MedQA |...] \
--model gpt \
--bs 3 \
--sample_hop 2 \
--max_extend_round 3 \
--no_graph \ # Optional
--fix_difficulty ["" | easy | medium | hard]
  • --bs is the batch size in Benchmark Grading stage
  • --sample_hop is the number of hop in sampling knowledge paths on context graph
  • --max_extend_round is the max number of extension round in Interactive Extension stage

The following two arguments are for ablation study.

  • --no_graph: When add this argument, JudgeAgent will synthetise questions with the texts randomly sampled from corpus_chunks.jsonl instead of paths sampled on context graph
  • --fix_difficulty: When this argument is not empty string "", JudgeAgent will synthetise questions under the fix_difficuly, refering to the ablation study "w/o difficulty-adaptive". When this argument is "", JudgeAgent will synthetise questions for each difficulty level before evaluation, ensuring the choice in dynamic evaluaiton.

step2. evaluate the target and get feedback

python evaluate_target.py \
--data [MedQA |...] \
--target XXX \
--model gpt \
--bs 3 \
--sample_hop 2 \
--max_extend_round 3 \
# following are ablation settings
--eval_all_rounds \ # Optional
--no_graph \ # Optional
--no_extension \ # Optional
--fix_difficulty ["" | easy | medium | hard]
  • --target is the target LLM to be evaluated, and --model is the LLM used in JudgeAgent for evaluation.

The following arguments are for ablation study about extension stage.

  • --eval_all_rounds: When add this argument, JudgeAgent will evalaute the target's performance until current rounds, provide suggetions, and validate the effectiveness after each round of extension.
  • --no_extension: When add this argument, JudgeAgent will skip Interactive Extension stage, and only evaluate the target's performance on base questions.

About

This is the source code of the paper "JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages