Skip to content

Conversation

@akshathmangudi
Copy link
Contributor

@akshathmangudi akshathmangudi commented Nov 21, 2025

Overview

Resolves #1019

Current status: READY FOR REVIEW

The current PR integrates MultiChallenge, a difficult benchmark that tests the ability of models to handle multi-turn conversation with human users. The PR consists of a single file multi_challenge.py, which contains the general structure of including a prompt function, and loading it as a configuration with LightEvalTaskConfig.

The implementation was also tested with the command

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" multi_challenge \
--bundle-dir multi-challenge --repo-id akshathmangudi/multi-challenge-gpt4o  \ 
--max-samples 10 --public

and results have been uploaded to my space here.

@akshathmangudi akshathmangudi marked this pull request as ready for review November 22, 2025 12:00
@akshathmangudi
Copy link
Contributor Author

cc: @NathanHB let me know your feedback on this!

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good thanks for the PR !! Only some nit that i think would make the definition better.

Tagging @kdesh0399 and @ekwinox117 in case you want to chime in 🤗

@akshathmangudi
Copy link
Contributor Author

akshathmangudi commented Nov 25, 2025

thanks for the feedback!

i have addressed your comments. let me know if this is okay!

p.s. i tried to run the evaluation task, but i ran out of credits 😅
image

update: nvm, got it up and running using openai/gpt-4o. linked the space below :)
https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

@kdesh0399
Copy link

Nice work, looks good to me!

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only few quick nits and we can merge !
Thank you for this it's very helpful 🤗

Copilot AI review requested due to automatic review settings December 9, 2025 14:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the MultiChallenge benchmark, a multi-turn conversational evaluation dataset designed to test LLMs' ability to handle complex conversations with human users. The implementation follows the lighteval framework's patterns for inspect-ai integration, using custom scorers and solvers to evaluate model responses against specific pass criteria using a judge LLM.

Key Changes

  • Adds a new task configuration for the MultiChallenge benchmark with judge-based evaluation
  • Implements custom scorer using GPT-4o as a judge model to evaluate responses against pass/fail criteria
  • Creates a conversation solver to handle multi-turn dialogue context properly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@akshathmangudi
Copy link
Contributor Author

akshathmangudi commented Dec 9, 2025

hey @NathanHB, ive made the changes. let me know what you think!

https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EVAL] MultiChallenge

3 participants