-
Notifications
You must be signed in to change notification settings - Fork 402
[EVAL] MultiChallenge #1075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[EVAL] MultiChallenge #1075
Conversation
|
cc: @NathanHB let me know your feedback on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good thanks for the PR !! Only some nit that i think would make the definition better.
Tagging @kdesh0399 and @ekwinox117 in case you want to chime in 🤗
|
thanks for the feedback! i have addressed your comments. let me know if this is okay! p.s. i tried to run the evaluation task, but i ran out of credits 😅 update: nvm, got it up and running using |
|
Nice work, looks good to me! |
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only few quick nits and we can merge !
Thank you for this it's very helpful 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR integrates the MultiChallenge benchmark, a multi-turn conversational evaluation dataset designed to test LLMs' ability to handle complex conversations with human users. The implementation follows the lighteval framework's patterns for inspect-ai integration, using custom scorers and solvers to evaluate model responses against specific pass criteria using a judge LLM.
Key Changes
- Adds a new task configuration for the MultiChallenge benchmark with judge-based evaluation
- Implements custom scorer using GPT-4o as a judge model to evaluate responses against pass/fail criteria
- Creates a conversation solver to handle multi-turn dialogue context properly
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
hey @NathanHB, ive made the changes. let me know what you think! https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt |

Overview
Resolves #1019
Current status: READY FOR REVIEW
The current PR integrates MultiChallenge, a difficult benchmark that tests the ability of models to handle multi-turn conversation with human users. The PR consists of a single file
multi_challenge.py, which contains the general structure of including a prompt function, and loading it as a configuration withLightEvalTaskConfig.The implementation was also tested with the command
and results have been uploaded to my space here.