-
Notifications
You must be signed in to change notification settings - Fork 392
[Feat] v0.5 Release Pack #846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <[email protected]>
* add csbench * run precommit --------- Co-authored-by: pbcong <[email protected]>
* [fix] batch size in openai compatible endpoint (#835) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * [Feature] Add WenetSpeech Dataset * add lmms-eval-0.5 doc's 1st draft * remove unneccessary parts in lmms-eval-0.5.md --------- Co-authored-by: b8zhong <[email protected]>
…modal Expansion**, detailing significant new features including: * A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech). * A production-ready **response caching system**. * Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3). * Addition of **numerous new benchmarks** across vision, coding, and STEM domains. * Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**.
…ltimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
if unit_prob != unit: | ||
pred = cal_not(parse_not(pred)) | ||
ans = cal_not((ans, unit)) | ||
if len(ans) > 1: | ||
ans = ans[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve full numeric answers during scientific notation handling
In scibench_process_results
, when a scientific-notation unit is detected, the code calls cal_not
and then immediately does if len(ans) > 1: ans = ans[0]
. Because ans
is a string after conversion, len(ans) > 1
is almost always true and the value is truncated to a single character (e.g., "1234.0"
becomes "1"
). This causes nearly every converted ground-truth answer to be incorrect and evaluation to fail even when predictions are accurate. The check should distinguish between tuple vs string instead of slicing the string to its first character.
Useful? React with 👍 / 👎.
…Eval v0.5 release notes, changing '†' to '+-'.
Updated metrics and model integration details in the documentation.
Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section.
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!