-
Notifications
You must be signed in to change notification settings - Fork 402
[EVAL] Long Horizon Execution #1074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[EVAL] Long Horizon Execution #1074
Conversation
|
cc: @NathanHB |
|
looking good ! Will run locally and review today or start of next week :) |
|
i ran the benchmark on HF Inference's https://huggingface.co/spaces/akshathmangudi/lhe-gpt4o-single |
…hteval into akshath/issue-1056-v2
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey ! Thanks for the hard work on this, i'm testing it locally right now. I have some small nits but it's looking almost ready !
src/lighteval/tasks/tasks/long_horizon_execution/single_turn.py
Outdated
Show resolved
Hide resolved
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on single turn, working great with the few nits I added above. However i cannot seems to make the multiturn work, can you ping when it's ready?
|
@NathanHB it should be working now, ive created a link below that tests both single and multi-turn. |
|
hey @akshathmangudi that's amazing !! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
sorry! it was private. made it public now :) |
|
great ! Maybe i'm mistaken but i only see single turn eval ? |
|
hey @akshathmangudi we are planning a release thisz week and would love the tasks you started implementing to be in it. I was just wondering if you were planning on finishing those or if i could take over ? Thanks ! 🤗 |
|
hey @NathanHB! sorry, been traveling all week. i'll have some space today and tomorrow, since a lot of the comments are nits and just things i accidentally overlooked (sorry for that), ill get them ready ASAP! |
|
https://huggingface.co/spaces/akshathmangudi/lhe-gpt ive updated the space to have multi-turn evaluation. please let me know if any changes have to be made 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements the Long Horizon Execution benchmark for evaluating language models' ability to maintain state and perform cumulative operations over long sequences. The implementation follows a research paper approach with both single-turn (process all keys at once) and multi-turn (incremental key processing) evaluation modes.
Key Changes
- Added complete task implementation with support for 7 context sizes (1024-65536) and 3 turn complexities (K=1, 2, 10)
- Implemented custom answer tag parsing scorers for extracting
<answer>formatted responses - Used binary search optimization to fit maximum items within prompt length constraints
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
src/lighteval/tasks/tasks/long_horizon_execution/constants.py |
Defines prompt templates and configuration constants for context sizes and turn complexities |
src/lighteval/tasks/tasks/long_horizon_execution/utils.py |
Implements binary search logic and prompt building functions for both single and multi-turn modes |
src/lighteval/tasks/tasks/long_horizon_execution/main.py |
Provides single-turn task implementation with scorer and creates task configurations |
src/lighteval/tasks/tasks/long_horizon_execution/multi_turn.py |
Implements multi-turn evaluation with conversation state tracking and fractional accuracy scoring |
Comments suppressed due to low confidence (2)
src/lighteval/tasks/tasks/long_horizon_execution/utils.py:130
- Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
In each turn, I'll provide {k} keys (comma-separated).
Respond with the current running sum, enclosed in tags.
Dictionary to maintain:
{dict_str}
Ready to start!
User: {keys_str}
Assistant:"](1).
return PROMPT_TEMPLATE_MULTI_START.format(
dict_str=dict_str, keys_str=keys_str, k=k, num_keys=len(first_turn_keys)
)
src/lighteval/tasks/tasks/long_horizon_execution/utils.py:194
- Surplus named argument for string format. An argument named 'num_keys' is provided, but it is not required by [format "You are an AI assistant. I will provide you with a dictionary and then give you keys in groups of {k}.
Your task is to keep a running total (starting from 0) by adding the values associated with the keys I provide.
In each turn, I'll provide {k} keys (comma-separated).
Respond with the current running sum, enclosed in tags.
Dictionary to maintain:
{dict_str}
Ready to start!
User: {keys_str}
Assistant:"](1).
initial_prompt = PROMPT_TEMPLATE_MULTI_START.format(
dict_str=dict_str, keys_str=first_turn_keys_str, k=k, num_keys=len(turn_chunks[0])
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
it's seems there are few valid nits that copilot has addressed, will be fixing them in a few hours |
|
hey @NathanHB, addressed almost all the comments and verified that the benchmark runs. let me know if there's anything else to address :) |
I screwed up my previous git clone, so I had to redo the changes 😅
Description:
Approach described within #1056.
Tasks:
/tasks/tasks/long_horizon_execution.py<answer>tags./tasks/tasks/long_horizon_execution.pySTATUS: ready for review.
Current behavior:
When we run
lighteval tasks inspect long_horizon_execution, the output has been shown below: