Skip to content

Commit 2ec3fcc

Browse files
committed
Merge branch 'main' into release/0.9
2 parents b26103b + 3baaf24 commit 2ec3fcc

File tree

97 files changed

+3567
-2515
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+3567
-2515
lines changed

README.md

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,16 @@
2424
> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
2525
2626
## 📋 Contents
27-
- [Introduction](#introduction)
28-
- [News](#News)
29-
- [Installation](#installation)
30-
- [Quick Start](#quick-start)
27+
- [Introduction](#-introduction)
28+
- [News](#-news)
29+
- [Installation](#️-installation)
30+
- [Quick Start](#-quick-start)
3131
- [Evaluation Backend](#evaluation-backend)
32-
- [Custom Dataset Evaluation](#custom-dataset-evaluation)
33-
- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
34-
- [Arena Mode](#arena-mode)
32+
- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
33+
- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
34+
- [Arena Mode](#-arena-mode)
35+
- [Contribution](#️-contribution)
36+
- [Roadmap](#-roadmap)
3537

3638

3739
## 📝 Introduction
@@ -72,11 +74,15 @@ Please scan the QR code below to join our community groups:
7274

7375

7476
## 🎉 News
77+
- 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
7578
- 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
7679
- 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
7780
- 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
7881
- 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
7982
- 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
83+
84+
<details><summary>More</summary>
85+
8086
- 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
8187
- 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
8288
- 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
@@ -88,7 +94,7 @@ Please scan the QR code below to join our community groups:
8894
- 🔥 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
8995
- 🔥 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.
9096

91-
97+
</details>
9298

9399
## 🛠️ Installation
94100
### Method 1: Install Using pip
@@ -278,7 +284,7 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
278284
- **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).
279285

280286

281-
## Model Serving Performance Evaluation
287+
## 📈 Model Serving Performance Evaluation
282288
A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.
283289

284290
Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
@@ -303,19 +309,32 @@ Speed Benchmark Results:
303309
+---------------+-----------------+----------------+
304310
```
305311

306-
## Custom Dataset Evaluation
312+
## 🖊️ Custom Dataset Evaluation
307313
EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
308314

309315

310-
## Arena Mode
316+
## 🏟️ Arena Mode
311317
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
312318

313319
Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
314320

321+
## 👷‍♂️ Contribution
315322

323+
EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn), is continuously optimizing its benchmark evaluation features! We invite you to refer to the [Contribution Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html) to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better! Join us now!
316324

325+
<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
326+
<table>
327+
<tr>
328+
<th colspan="2">
329+
<br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
330+
</th>
331+
</tr>
332+
</table>
333+
</a>
317334

318-
## TO-DO List
335+
## 🔜 Roadmap
336+
- [ ] Support for better evaluation report visualization
337+
- [x] Support for mixed evaluations across multiple datasets
319338
- [x] RAG evaluation
320339
- [x] VLM evaluation
321340
- [x] Agents evaluation
@@ -326,8 +345,6 @@ Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/lates
326345
- [ ] GAIA
327346
- [ ] GPQA
328347
- [x] MBPP
329-
- [ ] Auto-reviewer
330-
- [ ] Qwen-max
331348

332349

333350
## Star History

README_zh.md

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,15 @@
2525
> ⭐ 如果你喜欢这个项目,请点击右上角的 "Star" 按钮支持我们。你的支持是我们前进的动力!
2626
2727
## 📋 目录
28-
- [简介](#简介)
29-
- [新闻](#新闻)
30-
- [环境准备](#环境准备)
31-
- [快速开始](#快速开始)
32-
- [使用其他评测后端](#使用其他评测后端)
33-
- [自定义数据集评测](#自定义数据集评测)
34-
- [竞技场模式](#竞技场模式)
35-
- [性能评测工具](#推理性能评测工具)
28+
- [简介](#-简介)
29+
- [新闻](#-新闻)
30+
- [环境准备](#️-环境准备)
31+
- [快速开始](#-快速开始)
32+
- [其他评测后端](#-其他评测后端)
33+
- [自定义数据集评测](#-自定义数据集评测)
34+
- [竞技场模式](#-竞技场模式)
35+
- [性能评测工具](#-推理性能评测工具)
36+
- [贡献](#️-贡献)
3637

3738

3839

@@ -78,11 +79,14 @@ EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模
7879

7980

8081
## 🎉 新闻
82+
- 🔥🔥 **[2024.12.31]** 支持基准评测添加,参考[📖基准评测添加指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/add_benchmark.html);支持自定义混合数据集评测,用更少的数据,更全面的评测模型,参考[📖混合数据集评测指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/collection/index.html)
8183
- 🔥 **[2024.12.13]** 模型评测优化,不再需要传递`--template-type`参数;支持`evalscope eval --args`启动评测,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/get_started/basic_usage.html)
8284
- 🔥 **[2024.11.26]** 模型推理压测工具重构完成:支持本地启动推理服务、支持Speed Benchmark;优化异步调用错误处理,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/index.html)
8385
- 🔥 **[2024.10.31]** 多模态RAG评测最佳实践发布,参考[📖博客](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag)
8486
- 🔥 **[2024.10.23]** 支持多模态RAG评测,包括[CLIP_Benchmark](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/clip_benchmark.html)评测图文检索器,以及扩展了[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)以支持端到端多模态指标评测。
8587
- 🔥 **[2024.10.8]** 支持RAG评测,包括使用[MTEB/CMTEB](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/mteb.html)进行embedding模型和reranker的独立评测,以及使用[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)进行端到端评测。
88+
<details> <summary>更多</summary>
89+
8690
- 🔥 **[2024.09.18]** 我们的文档增加了博客模块,包含一些评测相关的技术调研和分享,欢迎[📖阅读](https://evalscope.readthedocs.io/zh-cn/latest/blog/index.html)
8791
- 🔥 **[2024.09.12]** 支持 LongWriter 评测,您可以使用基准测试 [LongBench-Write](evalscope/third_party/longbench_write/README.md) 来评测长输出的质量以及输出长度。
8892
- 🔥 **[2024.08.30]** 支持自定义数据集评测,包括文本数据集和多模态图文数据集。
@@ -93,7 +97,7 @@ EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模
9397
- 🔥 **[2024.06.29]** 支持**OpenCompass**作为第三方评测框架,我们对其进行了高级封装,支持pip方式安装,简化了评测任务配置。
9498
- 🔥 **[2024.06.13]** EvalScope与微调框架SWIFT进行无缝对接,提供LLM从训练到评测的全链路支持 。
9599
- 🔥 **[2024.06.13]** 接入Agent评测集ToolBench。
96-
100+
</details>
97101

98102
## 🛠️ 环境准备
99103
### 方式1. 使用pip安装
@@ -277,15 +281,15 @@ evalscope eval \
277281
参考:[全部参数说明](https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html)
278282

279283

280-
## 其他评测后端
284+
## 🧪 其他评测后端
281285
EvalScope支持使用第三方评测框架发起评测任务,我们称之为评测后端 (Evaluation Backend)。目前支持的Evaluation Backend有:
282286
- **Native**:EvalScope自身的**默认评测框架**,支持多种评测模式,包括单模型评测、竞技场模式、Baseline模型对比模式等。
283287
- [OpenCompass](https://github.com/open-compass/opencompass):通过EvalScope作为入口,发起OpenCompass的评测任务,轻量级、易于定制、支持与LLM微调框架[ms-wift](https://github.com/modelscope/swift)的无缝集成:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/opencompass_backend.html)
284288
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit):通过EvalScope作为入口,发起VLMEvalKit的多模态评测任务,支持多种多模态模型和数据集,支持与LLM微调框架[ms-wift](https://github.com/modelscope/swift)的无缝集成:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/vlmevalkit_backend.html)
285289
- **RAGEval**:通过EvalScope作为入口,发起RAG评测任务,支持使用[MTEB/CMTEB](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/mteb.html)进行embedding模型和reranker的独立评测,以及使用[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)进行端到端评测:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/index.html)
286290
- **ThirdParty**: 第三方评测任务,如[ToolBench](https://evalscope.readthedocs.io/zh-cn/latest/third_party/toolbench.html)[LongBench-Write](https://evalscope.readthedocs.io/zh-cn/latest/third_party/longwriter.html)
287291

288-
## 推理性能评测工具
292+
## 📈 推理性能评测工具
289293
一个专注于大型语言模型的压力测试工具,可以自定义以支持各种数据集格式和不同的API协议格式。
290294

291295
参考:性能测试[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/index.html)
@@ -311,15 +315,30 @@ Speed Benchmark Results:
311315
```
312316

313317

314-
## 自定义数据集评测
318+
## 🖊️ 自定义数据集评测
315319
EvalScope支持自定义数据集评测,具体请参考:自定义数据集评测[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset/index.html)
316320

317321

318-
## 竞技场模式
322+
## 🏟️ 竞技场模式
319323
竞技场模式允许多个候选模型通过两两对比(pairwise battle)的方式进行评测,并可以选择借助AI Enhanced Auto-Reviewer(AAR)自动评测流程或者人工评测的方式,最终得到评测报告。参考:竞技场模式[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)
320324

325+
## 👷‍♂️ 贡献
321326

322-
## TO-DO List
327+
EvalScope作为[ModelScope](https://modelscope.cn)的官方评测工具,其基准评测功能正在持续优化中!我们诚邀您参考[贡献指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/add_benchmark.html),轻松添加自己的评测基准,并与广大社区成员分享您的贡献。一起助力EvalScope的成长,让我们的工具更加出色!快来加入我们吧!
328+
329+
<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
330+
<table>
331+
<tr>
332+
<th colspan="2">
333+
<br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
334+
</th>
335+
</tr>
336+
</table>
337+
</a>
338+
339+
## 🔜 Roadmap
340+
- [ ] 支持更好的评测报告可视化
341+
- [x] 支持多数据集混合评测
323342
- [x] RAG evaluation
324343
- [x] VLM evaluation
325344
- [x] Agents evaluation
@@ -330,8 +349,7 @@ EvalScope支持自定义数据集评测,具体请参考:自定义数据集
330349
- [ ] GAIA
331350
- [ ] GPQA
332351
- [x] MBPP
333-
- [ ] Auto-reviewer
334-
- [ ] Qwen-max
352+
335353

336354

337355
## Star History

0 commit comments

Comments
 (0)