Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPQA for instruct models #534

Merged
merged 7 commits into from
Feb 5, 2025
Merged

Add GPQA for instruct models #534

merged 7 commits into from
Feb 5, 2025

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Feb 4, 2025

This PR adds GPQA for instruct models, along with support for all 3 subsets: main, extended, diamond.

Usage:

MODEL_ARGS=meta-llama/Llama-3.2-3B-Instruct,dtype=float16,tensor_parallel_size=1,max_model_length=32768,gpu_memory_utilisation=0.8 
lighteval vllm pretrained=$MODEL_ARGS "lighteval|gpqa:main|0|0" --use-chat-template

Here's the reference scores from the DeepSeek-R1 paper:

Screenshot 2025-02-04 at 16 01 46

Here's what I get with this implementation on the diamond subset and greedy decoding (looks OK within 1-2 std dev plus DeepSeek estimated pass@1 from 64 generations):

Model Metric Value Stderr
DeepSeek-R1-Distill-Qwen-1.5B extractive_match 0.3434 0.0336
DeepSeek-R1-Distill-Qwen-7B extractive_match 0.4545 0.0356
DeepSeek-R1-Distill-Llama-8B extractive_match 0.5152 0.0354

repo_id=repo_id,
path_or_fileobj=BytesIO(results_json.encode("utf-8")),
path_in_repo=f"{result_file_base_name}.json",
repo_type="dataset",
)
logger.info(f"Uploaded evaluation details to {url}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this so it's faster to click through to the final dataset in the logs

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

evaluation_splits=["train"],
few_shots_split=None,
few_shots_select="random_sampling",
generation_size=32_000,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to override this at runtime? I've set it to 32k to accomodate the new reasoning models, but let me know if you prefer a saner default!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with you defining 2 versions, a base and a reasoning one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to add it to extended? It would fit in the base evals imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will move!

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but I would preferably put it in the base metrics

choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]]
choices.insert(gold_index, line["Correct Answer"])

instruction = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clefourrier @NathanHB how much would you like me to iterate on this prompt. I saw Llama 3 has a fairly detailed one I can test, but I'm unsure if it's just optimised for Llama: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__gpqa__details?row=0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can replicate deepskeek's number with this one it's fine but feel free to test !

@lewtun lewtun merged commit 1ce7331 into main Feb 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants