ok Merge branch 'main' of github.com:modelscope/evalscope into releas…

…e/0.6
modelscope · Nov 22, 2024 · 45cf89b · 45cf89b
2 parents d289ece + b2abfe7
commit 45cf89b
Show file tree

Hide file tree

Showing 56 changed files with 900 additions and 1,172 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -8,19 +8,19 @@ repos:
                 thirdparty/|
                 examples/
             )$
-#  - repo: https://github.com/PyCQA/isort.git
-#    rev: 4.3.21
-#    hooks:
-#      - id: isort
-#  - repo: https://github.com/pre-commit/mirrors-yapf.git
-#    rev: v0.30.0
-#    hooks:
-#      - id: yapf
-#        exclude: |
-#            (?x)^(
-#                thirdparty/|
-#                examples/
-#            )$
+  - repo: https://github.com/PyCQA/isort.git
+    rev: 4.3.21
+    hooks:
+      - id: isort
+  - repo: https://github.com/pre-commit/mirrors-yapf.git
+    rev: v0.30.0
+    hooks:
+      - id: yapf
+        exclude: |
+            (?x)^(
+                thirdparty/|
+                examples/
+            )$
   - repo: https://github.com/pre-commit/pre-commit-hooks.git
     rev: v3.1.0
     hooks:

diff --git a/README.md b/README.md
@@ -17,6 +17,7 @@
  <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
 <p>
 
+> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 
 ## 📋 Table of Contents
 - [Introduction](#introduction)
@@ -42,7 +43,7 @@ EvalScope is the official model evaluation and performance benchmarking framewor
 The architecture includes the following modules:
 1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
 2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**: 
+3. **Evaluation Backend**:
     - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
     - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
     - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
@@ -129,7 +130,7 @@ You can execute this command from any directory:
 python -m evalscope.run \
  --model qwen/Qwen2-0.5B-Instruct \
  --template-type qwen \
- --datasets arc 
+ --datasets arc
 ```
 
 #### Install from source
@@ -236,13 +237,13 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
 
 ## Offline Evaluation
-You can use local dataset to evaluate the model without internet connection. 
+You can use local dataset to evaluate the model without internet connection.
 
 Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
 
 
 ## Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report. 
+The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
 
@@ -270,4 +271,4 @@ Refer to : Model Serving Performance Evaluation [📖 User Guide](https://evalsc
 
 ## Star History
 
-[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
+[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
diff --git a/README_zh.md b/README_zh.md
@@ -18,6 +18,8 @@
 <p>
 
 
+> ⭐ 如果你喜欢这个项目，请点击右上角的 "Star" 按钮支持我们。你的支持是我们前进的动力！
+
 ## 📋 目录
 - [简介](#简介)
 - [新闻](#新闻)
@@ -46,7 +48,7 @@ EvalScope包括以下模块：
 
 2. **Data Adapter**: 数据适配器，负责转换和处理输入数据，以便适应不同的评估需求和格式。
 
-3. **Evaluation Backend**: 
+3. **Evaluation Backend**:
     - **Native**：EvalScope自身的**默认评测框架**，支持多种评估模式，包括单模型评估、竞技场模式、Baseline模型对比模式等。
     - **OpenCompass**：支持[OpenCompass](https://github.com/open-compass/opencompass)作为评测后端，对其进行了高级封装和任务简化，您可以更轻松地提交任务进行评估。
     - **VLMEvalKit**：支持[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)作为评测后端，轻松发起多模态评测任务，支持多种多模态模型和数据集。
@@ -138,7 +140,7 @@ pip install -e '.[all]'           # 安装所有 backends (Native, OpenCompass,
 python -m evalscope.run \
  --model qwen/Qwen2-0.5B-Instruct \
  --template-type qwen \
- --datasets arc 
+ --datasets arc
 ```
 
 #### 使用源码安装
@@ -176,7 +178,7 @@ python evalscope/run.py \
 
 **示例2：**
 ```shell
-python evalscope/run.py \ 
+python evalscope/run.py \
  --model qwen/Qwen2-0.5B-Instruct \
  --template-type qwen \
  --generation-config do_sample=false,temperature=0.0 \
@@ -219,7 +221,7 @@ your_task_cfg = {
         'dataset_args': {},
         'dry_run': False,
         'model': 'qwen/Qwen2-0.5B-Instruct',
-        'template_type': 'qwen', 
+        'template_type': 'qwen',
         'datasets': ['arc', 'hellaswag'],
         'work_dir': DEFAULT_ROOT_CACHE_DIR,
         'outputs': DEFAULT_ROOT_CACHE_DIR,
@@ -280,4 +282,4 @@ EvalScope支持自定义数据集评测，具体请参考：自定义数据集
 
 ## Star History
 
-[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
+[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
diff --git a/docs/en/user_guides/backend/opencompass_backend.md b/docs/en/user_guides/backend/opencompass_backend.md
@@ -15,7 +15,19 @@ There are two ways to download datasets. The automatic download method supports
 You can view the dataset name list using the following code:
 ```python
 from evalscope.backend.opencompass import OpenCompassBackendManager
-print(f'All datasets from OpenCompass backend: {OpenCompassBackendManager.list_datasets()}')
+# list datasets
+OpenCompassBackendManager.list_datasets()
+
+>>> ['summedits', 'humaneval', 'lambada', 
+'ARC_c', 'ARC_e', 'CB', 'C3', 'cluewsc', 'piqa',
+ 'bustm', 'storycloze', 'lcsts', 'Xsum', 'winogrande', 
+ 'ocnli', 'AX_b', 'math', 'race', 'hellaswag', 
+ 'WSC', 'eprstmt', 'siqa', 'agieval', 'obqa',
+ 'afqmc', 'GaokaoBench', 'triviaqa', 'CMRC', 
+ 'chid', 'gsm8k', 'ceval', 'COPA', 'ReCoRD', 
+ 'ocnli_fc', 'mbpp', 'csl', 'tnews', 'RTE', 
+ 'cmnli', 'AX_g', 'nq', 'cmb', 'BoolQ', 'strategyqa', 
+ 'mmlu', 'WiC', 'MultiRC', 'DRCD', 'cmmlu']
 ```
 ````
 

diff --git a/docs/zh/user_guides/backend/opencompass_backend.md b/docs/zh/user_guides/backend/opencompass_backend.md
@@ -18,10 +18,22 @@ pip install evalscope[opencompass] -U
 
 数据集的详细信息可以参考[OpenCompass数据集列表](../../get_started/supported_dataset.md#2-opencompass评测后端支持的数据集)
 
-您可以使用以下方式，来查看数据集的名称列表：
+您可以使用以下方式，来查看支持的数据集的名称列表：
 ```python
 from evalscope.backend.opencompass import OpenCompassBackendManager
-print(f'All datasets from OpenCompass backend: {OpenCompassBackendManager.list_datasets()}')
+# 显示支持的数据集名称列表
+OpenCompassBackendManager.list_datasets()
+
+>>> ['summedits', 'humaneval', 'lambada', 
+'ARC_c', 'ARC_e', 'CB', 'C3', 'cluewsc', 'piqa',
+ 'bustm', 'storycloze', 'lcsts', 'Xsum', 'winogrande', 
+ 'ocnli', 'AX_b', 'math', 'race', 'hellaswag', 
+ 'WSC', 'eprstmt', 'siqa', 'agieval', 'obqa',
+ 'afqmc', 'GaokaoBench', 'triviaqa', 'CMRC', 
+ 'chid', 'gsm8k', 'ceval', 'COPA', 'ReCoRD', 
+ 'ocnli_fc', 'mbpp', 'csl', 'tnews', 'RTE', 
+ 'cmnli', 'AX_g', 'nq', 'cmb', 'BoolQ', 'strategyqa', 
+ 'mmlu', 'WiC', 'MultiRC', 'DRCD', 'cmmlu']
 ```
 ````
 

diff --git a/evalscope/backend/opencompass/tasks/eval_datasets.py b/evalscope/backend/opencompass/tasks/eval_datasets.py
@@ -50,6 +50,7 @@
     from opencompass.configs.datasets.nq.nq_gen_c788f6 import nq_datasets
     from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
     from opencompass.configs.datasets.cmb.cmb_gen_dfb5c4 import cmb_datasets
+    from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
 
     # Note: to be supported
     # from opencompass.configs.datasets.flores.flores_gen_806ede import flores_datasets