[EVAL] MultiChallenge #1075

akshathmangudi · 2025-11-21T12:24:01Z

Overview

Resolves #1019

Current status: READY FOR REVIEW

The current PR integrates MultiChallenge, a difficult benchmark that tests the ability of models to handle multi-turn conversation with human users. The PR consists of a single file multi_challenge.py, which contains the general structure of including a prompt function, and loading it as a configuration with LightEvalTaskConfig.

The implementation was also tested with the command

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" multi_challenge \
--bundle-dir multi-challenge --repo-id akshathmangudi/multi-challenge-gpt4o  \ 
--max-samples 10 --public

and results have been uploaded to my space here.

akshathmangudi · 2025-11-22T12:00:48Z

cc: @NathanHB let me know your feedback on this!

src/lighteval/tasks/tasks/multi_challenge.py

NathanHB

Looking very good thanks for the PR !! Only some nit that i think would make the definition better.

Tagging @kdesh0399 and @ekwinox117 in case you want to chime in 🤗

akshathmangudi · 2025-11-25T07:29:10Z

thanks for the feedback!

i have addressed your comments. let me know if this is okay!

p.s. i tried to run the evaluation task, but i ran out of credits 😅

update: nvm, got it up and running using openai/gpt-4o. linked the space below :)
https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

kdesh0399 · 2025-11-25T14:54:18Z

Nice work, looks good to me!

NathanHB

Only few quick nits and we can merge !
Thank you for this it's very helpful 🤗

src/lighteval/tasks/tasks/multi_challenge.py

Copilot

Pull request overview

This PR integrates the MultiChallenge benchmark, a multi-turn conversational evaluation dataset designed to test LLMs' ability to handle complex conversations with human users. The implementation follows the lighteval framework's patterns for inspect-ai integration, using custom scorers and solvers to evaluate model responses against specific pass criteria using a judge LLM.

Key Changes

Adds a new task configuration for the MultiChallenge benchmark with judge-based evaluation
Implements custom scorer using GPT-4o as a judge model to evaluate responses against pass/fail criteria
Creates a conversation solver to handle multi-turn dialogue context properly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lighteval/tasks/tasks/multi_challenge.py

akshathmangudi · 2025-12-09T15:40:31Z

hey @NathanHB, ive made the changes. let me know what you think!

https://huggingface.co/spaces/akshathmangudi/multi_challenge-gpt

akshathmangudi added 3 commits November 21, 2025 17:53

initial commit

3ffdf0c

multi challenge impl, ready for review

d4cda44

docstring fixes

f1d82ef

akshathmangudi marked this pull request as ready for review November 22, 2025 12:00

Merge branch 'main' into akshath/issue-1019-v2

c8ff7d3