Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMLU 测评结果与官方差距较大 #267

Open
Haruka1307 opened this issue Dec 30, 2024 · 5 comments
Open

MMLU 测评结果与官方差距较大 #267

Haruka1307 opened this issue Dec 30, 2024 · 5 comments

Comments

@Haruka1307
Copy link

image 如您官网所示,llama2-7b的mmlu分数与官方差了快5个点,我测评llama3-8b-instruct也只有0.6396,感觉相差很多,请问是什么原因呢?
@wangxingjun778
Copy link
Collaborator

一般来讲,各个评测框架,如OpenCompass、harness等,或者官方的评测逻辑,因为在prompt构造(包括如果有few-shot的话,采样逻辑和各模型对few-shot的following能力)、模型推理参数、结果解析逻辑都有一些差异,最终就会导致评测结果有差异; 通常建议尽量采用一个框架来横向评测各个模型。

@Haruka1307
Copy link
Author

我看了您们项目mmlu测评的源代码,逻辑是取logits可能性最大的值,与原测评做法一致。这个应该与采样无关
image
prompt似乎也是和官方一致的:
image
image

@Haruka1307
Copy link
Author

是不是有可能官方测评是用的micro_avg?
weighted_acc = np.mean(np.concatenate(all_cors))
您最后的输出为WeightedAverageAccuracy,有可能是这方面的原因。请问后续是否是支持指定测评metric为macro,micro,weighted呢?谢谢!

@Yunnglin
Copy link
Collaborator

Yunnglin commented Jan 6, 2025

我们最终也是类似micro avg的计算方式,后续会支持更多评测指标

@Haruka1307
Copy link
Author

我们最终也是类似micro avg的计算方式,后续会支持更多评测指标

报告显示最后是weighted_acc
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants