-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPQA for instruct models #534
Conversation
repo_id=repo_id, | ||
path_or_fileobj=BytesIO(results_json.encode("utf-8")), | ||
path_in_repo=f"{result_file_base_name}.json", | ||
repo_type="dataset", | ||
) | ||
logger.info(f"Uploaded evaluation details to {url}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this so it's faster to click through to the final dataset in the logs
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
evaluation_splits=["train"], | ||
few_shots_split=None, | ||
few_shots_select="random_sampling", | ||
generation_size=32_000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to override this at runtime? I've set it to 32k to accomodate the new reasoning models, but let me know if you prefer a saner default!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be OK with you defining 2 versions, a base and a reasoning one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to add it to extended? It would fit in the base evals imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will move!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but I would preferably put it in the base metrics
choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]] | ||
choices.insert(gold_index, line["Correct Answer"]) | ||
|
||
instruction = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clefourrier @NathanHB how much would you like me to iterate on this prompt. I saw Llama 3 has a fairly detailed one I can test, but I'm unsure if it's just optimised for Llama: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__gpqa__details?row=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we can replicate deepskeek's number with this one it's fine but feel free to test !
This PR adds GPQA for instruct models, along with support for all 3 subsets: main, extended, diamond.
Usage:
Here's the reference scores from the DeepSeek-R1 paper:
Here's what I get with this implementation on the
diamond
subset and greedy decoding (looks OK within 1-2 std dev plus DeepSeek estimatedpass@1
from 64 generations):