Skip to content

Releases: modelscope/evalscope

v0.9.0 release

03 Jan 09:11
Compare
Choose a tag to compare

What's Changed

#253

  • Support for specifying model service API URL for evaluation: Evaluation can be performed on both local and remote model services.
  • Support for custom schema for mixed data evaluation: Combine different datasets for a more comprehensive assessment of model -capabilities with less data.
  • Add benchmark contribution guidelines: Users can add their own benchmarks to make the tool more powerful and beneficial for more people.

中文

#253

  • 支持指定模型服务API URL评测:不论是本地模型还是远端模型服务都可以评测
  • 支持自定义schema进行数据混合评测:混合不同的数据集,用更少的数据,更全面的评估模型能力
  • 添加benchmark贡献指南:可以自行添加benchmark,让工具变的更强大,让更多人受益

Full Changelog: v0.8.2...v0.9.0

v0.8.2 release

26 Dec 12:08
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.8.1...v0.8.2

v0.8.1 release

17 Dec 12:06
Compare
Choose a tag to compare

What's Changed

中文版本

  • 统一 opencompassvlmeval 输出目录,作者:@Yunnglin,相关链接:#242
  • 模型压测:增加更多指标,作者:@Yunnglin,相关链接:#245
  • 模型压测:添加trust remote参数,作者:@Yunnglin,相关链接:#246
  • 兼容 ms-swift<3.0,作者:@Yunnglin,相关链接:#249
  • 修复本地评估的 humaneval 问题,作者:@Yunnglin,相关链接:#248

Full Changelog: v0.8.0...v0.8.1

v0.8.0 release

14 Dec 17:30
Compare
Choose a tag to compare

Release Notes

  1. Optimize Native eval and remove template_type #231
  2. The evalscope perf command supports the --outputs-dir configuration. #232
  3. Support ragas 0.2.7 #234

Bug Fixes

  1. Fix longwriter docs #239
  2. Fix lint for longwriter #240
  3. Fix lint #237
  4. Unify perf output #238

Documentation Updates

  1. Fix longwriter docs #239
  2. Optimize Native eval and remove template_type #231

中文说明

特性

  1. 取消Native模式评测中template_type参数 #231
  2. perf模块支持--output-dir #232
  3. 支持适配最新的ragas 0.2.7版本 #234

缺陷修复

  1. 修复longwriter代码示例,优化流程 #239
  2. 修复lint,以及longwriter的lint #240 #237

文档更新

  1. 更新longwriter文档 #239
  2. 更新Native评测模式的相关文档 #231

v0.7.2 release

04 Dec 04:24
Compare
Choose a tag to compare

Release Note

  1. Remove pyarrow version requirement #225
  2. Optimize warning info #223

中文说明

  1. 移除 pyarrow 版本要求 #225
  2. 优化 warning 信息 #223

v0.7.1 release

28 Nov 18:30
Compare
Choose a tag to compare

Release Notes

  1. Add PMMEval benchmark #222

中文说明

特性

  1. 增加PMMEval评测集 #222

v0.7.0 release

28 Nov 07:14
Compare
Choose a tag to compare

Release Notes

  1. Refactor the perf module, more robust and easier to use. #178
  2. Add speed benchmarking in the perf module. #178
  3. Add multi-modal benchmark flickr8k in the perf module for speed benchmark. #211

Bug Fixes

  1. Add timeout for download punkt.zip #206
  2. Fix parallel for speed benchmarking in the perf module. #215

Documentation Updates

  1. Update VLM-Eval doc #209
  2. Update perf module doc #178 #211

中文说明

特性

  1. 重构perf模块,更鲁棒、更易用。 #178
  2. perf模块中添加速度基准测试。 #178
  3. perf模块中添加多模态基准 flickr8k 以进行速度基准测试。 #211

缺陷修复

  1. 修复下载punkt.zip的超时问题。 #206
  2. 修复perf模块中的速度基准测试并行问题。 #215

文档更新

  1. 更新VLM-Eval文档。 #209
  2. 更新perf模块文档。 #178 #211

v0.6.1 release

22 Nov 06:34
Compare
Choose a tag to compare

Release Notes

  1. Add CMMLU benchmark #198
  2. Add publish workflow #186
  3. Adapt RAGAS v0.2.5 and update readme #205
  4. Adapt MTEB v1.19 #196

Bug Fixes

  1. Set datasets version: dataset>=3.0.0, <=3.0.1 #184
  2. Set pyarrow version to <=17.0.0 to avoid installation issue on OSX. #187
  3. Add timeout for download punkt.zip #206

Documentation Updates

  1. Update OpenCompass list all datasets docs #199
  2. Update RAGAS v0.2.5 docs #205

中文说明

特性

  1. 支持CMMLU benchmark #198
  2. 支持publish 流程 #186
  3. 适配RAGAS v0.2.5并更新文档 #205
  4. 适配 MTEB v1.19 #196

缺陷修复

  1. 设置datasets 版本,修复兼容性问题: dataset>=3.0.0, <=3.0.1 #184
  2. 设置 pyarrow版本:<=17.0.0 修复在OSX操作系统下的安装问题 #187
  3. 增加下载punkt.zip时的超时时间 #206

文档更新

  1. 更新OpenCompass作为backend时所支持的数据集列表文档 #199
  2. 更新RAGAS v0.2.5 文档 #205

Release v0.6.0

08 Nov 05:51
Compare
Choose a tag to compare

Release Notes

  1. Support multi-modal RAG evaluation #149
    • Add CLIP_Benchmark
    • Add end-to-end multi-modal RAG evaluation in Ragas
  2. To be compatible with Ragas v0.2.3 #165 #171
  3. Support truncating input for CLIP models #163 #164
  4. Support saving knowledge graphs when generating datasets in Ragas #175

Bug Fixes

  1. Fix issue of abnormal metrics during CMTEB evaluation #157
  2. Fix issue of GenerationConfig being None #173
  3. Update datasets version constraints #184
  4. Add publish workflow #186

Documentation Updates

  1. Update VLMEvalKit documentation #166
  2. Update multi-modal RAG blog #172

中文说明

特性

  1. 添加多模态RAG评测支持 #149
    • 支持CLIP_Benchmark
    • 支持Ragas端到端多模态RAG评测
  2. 兼容Ragas v0.2.3 #165 #171
  3. 支持CLIP模型截断输入 #163 #164
  4. 支持Ragas生成数据集时保存知识图谱 #175

缺陷修复

  1. 修复CMTEB评估时指标异常的问题 #157
  2. 修复GenerationConfig为None的异常 #173
  3. 更新datasets版本限制 #184
  4. 增加publish workflow #186

文档更新

  1. 更新VLMEvalKit文档 #166
  2. 更新多模态RAG博客 #172

Release v0.5.5

15 Oct 02:57
Compare
Choose a tag to compare

Release Notes

  1. Added Dataset Support:

    • Enhanced multimodal evaluation capabilities, now supporting MMBench-Video, Video-MME, and MVBench video evaluations #146
    • Added cmb dataset #117
  2. Support for LongBench-write quality evaluation of long text generation #136

  3. Automatic downloading of punkt_tab.zip from nltk #140

  4. Support for RAG evaluation #127:

    • Support for embeddings/reranker evaluation: Integration of MTEB (Massive Text Embedding Benchmark) and CMTEB (Chinese Massive Text Embedding Benchmark), supporting tasks such as retrieval and reranking
    • Support for end-to-end RAG evaluation: Integration of the ragas framework, supporting automatic generation of evaluation datasets and evaluation based on judge models
  5. Documentation Updates:

  6. Updated dependencies: nltk>=3.9 and rouge-score>=0.1.0 #145, #143

中文说明

  1. 新增数据集支持:

    • 完善多模态评测功能,支持MMBench-Video,Video-MME,MVBench视频评测 #146
    • 新增cmb数据集 #117
  2. 支持LongBench-write 长文本生成的质量评测 #136

  3. 支持从nltk自动下载 punkt_tab.zip #140

  4. 支持RAG评测:#127

    • 支持embeddings/reranker 评测:集成MTEB(Massive Text Embedding Benchmark)和 CMTEB(Chinese Massive Text Embedding Benchmark),支持检索、重排等任务评估
    • 支持RAG端到端评测:集成ragas框架,支持自动生成评测数据集和基于裁判员模型的评测
  5. 文档更新

  6. 更新依赖nltk>=3.9rouge-score>=0.1.0 #145, #143