EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

This is the official implementation of EffiEval, a training-free benchmarking framework for large language models (LLMs). EffiEval efficiently selects representative subsets of evaluation data, ensuring representativeness, fairness, and generalizability while maintaining strong ranking consistency with full-dataset evaluation. It is scalable and flexible, allowing users to balance evaluation efficiency and reliability. This work is built upon Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric.

1. Installation

Clone the repository and create environment:

git clone https://github.com/ALEX-nlp/EffiEval.git
cd EffiEval
conda create -n effieval python=3.11 -y
conda activate effieval

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

pip install -r requirements.txt

2. Usage

2.1 Prepare indicator model responses

EffiEval enables efficient benchmarking by first evaluating an indicator model on a given dataset. To do this, both the model and dataset need to be prepared. The code structure is organized as follows:

EffiEval
├── data
│   ├── gsm8k
│   │   └── test.json
│   ├── ...
│   └── mmlu
│       └── test.json
├── get_performance.py # 2.1 Prepare indicator model responses
├── get_neuron.py # 2.2 Compute neurons of the indicator model
├── selection.py # 2.3 Subset selection
└── utils
    ├── dataset.py # dataset configuration
    ├── model.py # model configuration
    └── utils_neuron.py

2.1.1 Prepare dataset for evaluation

In utils/dataset.py, the following should be implemented:

load_local_dataset(task_name: str) -> list[dict[str, str]]
get_input_sample(task_name: str, sample: dict[str, str]) -> tuple[str, str]
The corresponding evaluation function, registered in EVALUATION_FUNC Several examples are provided in the file.

2.1.2 Prepare model for evaluation

In utils/model.py, implement the following if necessary:

format_tokens
get_model_output

Then, register the model name in MODEL_PATHS (e.g. "qwen2.5": "Qwen/Qwen2.5-7B-Instruct"). Several examples are also provided in the file.

Note: When the model is based on an online API, the MODEL_PATHS entry should be like:
"gpt-4o-2024-11-20": None
In this case, the preparation steps above can be skipped. The OPENAI_KEY can be configured in .env file.

2.1.3 Run `get_performance.py` to evaluate the model

Example usage:

if __name__ == '__main__':
    get_performance("qwen2.5", "gsm8k")

The evaluation results will be saved in the ./response directory.
The model name (e.g., "qwen2.5") and dataset name (e.g., "gsm8k") should match the entries in MODEL_PATHS and load_local_dataset.

2.2 Compute neurons of the indicator model

When we obtain the outputs of the indicator model, we can compute the neurons activated by each sample based on these outputs. This functionality is implemented in get_neuron.py:

if __name__ == '__main__':
    get_neuron("qwen2.5", "gsm8k")

The activated neurons of the indicator model on the dataset will be saved in ./neurons by default.

2.3 Select a subset using the indicator model

In selection.py, load the activated neurons first:

# (indicator_model, topk, dataset_name)
neuron_config = NeuronConfig("qwen2.5", 0.001, "gsm8k")
# np.ndarray with shape [num_sample, num_neuron]
matrix = neuron_config.get_matrix()

Then this matrix can then be used to solve the Maximum Coverage Problem (MCP):

indices, coverage = greedy_maximum_coverage(matrix, k=100)

indices: np.ndarray, indices of the selected samples
coverage: int, number of covered activated neurons

Save the subset to disk:

dataset = load_local_dataset("gsm8k")
subset = [dataset[idx] for idx in indices]

with open("subset.json", "w") as fp:
    json.dump(subset, fp)

2.4 Verify the selected subset

You can verify the subset using verify_selection in selection.py.
For example, after evaluating several models (registered in MODEL_PATHS) through get_performance.py, run:

verify_selection(
    models=list(MODEL_PATHS.keys()),
    task="gsm8k",
    k=100,
    neuron_config=NeuronConfig("qwen2.5", 0.001, "gsm8k")
)

This will print the correlation (r_S, r_K) and MAE between the performance of the models on the full dataset and the selected subset.

3. Citation

If you find this work helpful, please consider citing:

@misc{cao2025effievalefficientgeneralizable,
      title={EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization},
      author={Yixin Cao and Jiahao Ying and Yubo Ma and Yugang Jiang and Yaoning Wang},
      year={2025},
      eprint={2508.09662},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.09662},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

1. Installation

2. Usage

2.1 Prepare indicator model responses

2.1.1 Prepare dataset for evaluation

2.1.2 Prepare model for evaluation

2.1.3 Run `get_performance.py` to evaluate the model

2.2 Compute neurons of the indicator model

2.3 Select a subset using the indicator model

2.4 Verify the selected subset

3. Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
utils		utils
.env		.env
.gitignore		.gitignore
README.md		README.md
get_neuron.py		get_neuron.py
get_performance.py		get_performance.py
requirements.txt		requirements.txt
selection.py		selection.py

ALEX-nlp/EffiEval

Folders and files

Latest commit

History

Repository files navigation

EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

1. Installation

2. Usage

2.1 Prepare indicator model responses

2.1.1 Prepare dataset for evaluation

2.1.2 Prepare model for evaluation

2.1.3 Run get_performance.py to evaluate the model

2.2 Compute neurons of the indicator model

2.3 Select a subset using the indicator model

2.4 Verify the selected subset

3. Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2.1.3 Run `get_performance.py` to evaluate the model

Packages