Skip to content

Commit

Permalink
ok Merge branch 'main' of github.com:modelscope/evalscope into releas…
Browse files Browse the repository at this point in the history
…e/0.6
  • Loading branch information
wangxingjun778 committed Nov 22, 2024
2 parents d289ece + b2abfe7 commit 45cf89b
Show file tree
Hide file tree
Showing 56 changed files with 900 additions and 1,172 deletions.
26 changes: 13 additions & 13 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ repos:
thirdparty/|
examples/
)$
# - repo: https://github.com/PyCQA/isort.git
# rev: 4.3.21
# hooks:
# - id: isort
# - repo: https://github.com/pre-commit/mirrors-yapf.git
# rev: v0.30.0
# hooks:
# - id: yapf
# exclude: |
# (?x)^(
# thirdparty/|
# examples/
# )$
- repo: https://github.com/PyCQA/isort.git
rev: 4.3.21
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-yapf.git
rev: v0.30.0
hooks:
- id: yapf
exclude: |
(?x)^(
thirdparty/|
examples/
)$
- repo: https://github.com/pre-commit/pre-commit-hooks.git
rev: v3.1.0
hooks:
Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
<a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
<p>

> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
## 📋 Table of Contents
- [Introduction](#introduction)
Expand All @@ -42,7 +43,7 @@ EvalScope is the official model evaluation and performance benchmarking framewor
The architecture includes the following modules:
1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
3. **Evaluation Backend**:
3. **Evaluation Backend**:
- **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
- **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
- **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
Expand Down Expand Up @@ -129,7 +130,7 @@ You can execute this command from any directory:
python -m evalscope.run \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--datasets arc
--datasets arc
```

#### Install from source
Expand Down Expand Up @@ -236,13 +237,13 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)

## Offline Evaluation
You can use local dataset to evaluate the model without internet connection.
You can use local dataset to evaluate the model without internet connection.

Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)


## Arena Mode
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.

Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)

Expand Down Expand Up @@ -270,4 +271,4 @@ Refer to : Model Serving Performance Evaluation [📖 User Guide](https://evalsc

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
12 changes: 7 additions & 5 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
<p>


> ⭐ 如果你喜欢这个项目,请点击右上角的 "Star" 按钮支持我们。你的支持是我们前进的动力!
## 📋 目录
- [简介](#简介)
- [新闻](#新闻)
Expand Down Expand Up @@ -46,7 +48,7 @@ EvalScope包括以下模块:

2. **Data Adapter**: 数据适配器,负责转换和处理输入数据,以便适应不同的评估需求和格式。

3. **Evaluation Backend**:
3. **Evaluation Backend**:
- **Native**:EvalScope自身的**默认评测框架**,支持多种评估模式,包括单模型评估、竞技场模式、Baseline模型对比模式等。
- **OpenCompass**:支持[OpenCompass](https://github.com/open-compass/opencompass)作为评测后端,对其进行了高级封装和任务简化,您可以更轻松地提交任务进行评估。
- **VLMEvalKit**:支持[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)作为评测后端,轻松发起多模态评测任务,支持多种多模态模型和数据集。
Expand Down Expand Up @@ -138,7 +140,7 @@ pip install -e '.[all]' # 安装所有 backends (Native, OpenCompass,
python -m evalscope.run \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--datasets arc
--datasets arc
```

#### 使用源码安装
Expand Down Expand Up @@ -176,7 +178,7 @@ python evalscope/run.py \

**示例2:**
```shell
python evalscope/run.py \
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--generation-config do_sample=false,temperature=0.0 \
Expand Down Expand Up @@ -219,7 +221,7 @@ your_task_cfg = {
'dataset_args': {},
'dry_run': False,
'model': 'qwen/Qwen2-0.5B-Instruct',
'template_type': 'qwen',
'template_type': 'qwen',
'datasets': ['arc', 'hellaswag'],
'work_dir': DEFAULT_ROOT_CACHE_DIR,
'outputs': DEFAULT_ROOT_CACHE_DIR,
Expand Down Expand Up @@ -280,4 +282,4 @@ EvalScope支持自定义数据集评测,具体请参考:自定义数据集

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
14 changes: 13 additions & 1 deletion docs/en/user_guides/backend/opencompass_backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,19 @@ There are two ways to download datasets. The automatic download method supports
You can view the dataset name list using the following code:
```python
from evalscope.backend.opencompass import OpenCompassBackendManager
print(f'All datasets from OpenCompass backend: {OpenCompassBackendManager.list_datasets()}')
# list datasets
OpenCompassBackendManager.list_datasets()
>>> ['summedits', 'humaneval', 'lambada',
'ARC_c', 'ARC_e', 'CB', 'C3', 'cluewsc', 'piqa',
'bustm', 'storycloze', 'lcsts', 'Xsum', 'winogrande',
'ocnli', 'AX_b', 'math', 'race', 'hellaswag',
'WSC', 'eprstmt', 'siqa', 'agieval', 'obqa',
'afqmc', 'GaokaoBench', 'triviaqa', 'CMRC',
'chid', 'gsm8k', 'ceval', 'COPA', 'ReCoRD',
'ocnli_fc', 'mbpp', 'csl', 'tnews', 'RTE',
'cmnli', 'AX_g', 'nq', 'cmb', 'BoolQ', 'strategyqa',
'mmlu', 'WiC', 'MultiRC', 'DRCD', 'cmmlu']
```
````

Expand Down
16 changes: 14 additions & 2 deletions docs/zh/user_guides/backend/opencompass_backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,22 @@ pip install evalscope[opencompass] -U
数据集的详细信息可以参考[OpenCompass数据集列表](../../get_started/supported_dataset.md#2-opencompass评测后端支持的数据集)
您可以使用以下方式,来查看数据集的名称列表
您可以使用以下方式,来查看支持的数据集的名称列表
```python
from evalscope.backend.opencompass import OpenCompassBackendManager
print(f'All datasets from OpenCompass backend: {OpenCompassBackendManager.list_datasets()}')
# 显示支持的数据集名称列表
OpenCompassBackendManager.list_datasets()
>>> ['summedits', 'humaneval', 'lambada',
'ARC_c', 'ARC_e', 'CB', 'C3', 'cluewsc', 'piqa',
'bustm', 'storycloze', 'lcsts', 'Xsum', 'winogrande',
'ocnli', 'AX_b', 'math', 'race', 'hellaswag',
'WSC', 'eprstmt', 'siqa', 'agieval', 'obqa',
'afqmc', 'GaokaoBench', 'triviaqa', 'CMRC',
'chid', 'gsm8k', 'ceval', 'COPA', 'ReCoRD',
'ocnli_fc', 'mbpp', 'csl', 'tnews', 'RTE',
'cmnli', 'AX_g', 'nq', 'cmb', 'BoolQ', 'strategyqa',
'mmlu', 'WiC', 'MultiRC', 'DRCD', 'cmmlu']
```
````

Expand Down
1 change: 1 addition & 0 deletions evalscope/backend/opencompass/tasks/eval_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
from opencompass.configs.datasets.nq.nq_gen_c788f6 import nq_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.cmb.cmb_gen_dfb5c4 import cmb_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets

# Note: to be supported
# from opencompass.configs.datasets.flores.flores_gen_806ede import flores_datasets
Expand Down
Loading

0 comments on commit 45cf89b

Please sign in to comment.