Add GPQA for instruct models #534

lewtun · 2025-02-04T14:55:28Z

This PR adds GPQA for instruct models, along with support for all 3 subsets: main, extended, diamond.

Usage:

MODEL_ARGS=meta-llama/Llama-3.2-3B-Instruct,dtype=float16,tensor_parallel_size=1,max_model_length=32768,gpu_memory_utilisation=0.8 
lighteval vllm pretrained=$MODEL_ARGS "lighteval|gpqa:main|0|0" --use-chat-template

Here's the reference scores from the DeepSeek-R1 paper:

Here's what I get with this implementation on the diamond subset and greedy decoding (looks OK within 1-2 std dev plus DeepSeek estimated pass@1 from 64 generations):

Model	Metric	Value	Stderr
DeepSeek-R1-Distill-Qwen-1.5B	extractive_match	0.3434	0.0336
DeepSeek-R1-Distill-Qwen-7B	extractive_match	0.4545	0.0356
DeepSeek-R1-Distill-Llama-8B	extractive_match	0.5152	0.0354

lewtun · 2025-02-04T14:55:51Z

src/lighteval/logging/evaluation_tracker.py

            repo_id=repo_id,
            path_or_fileobj=BytesIO(results_json.encode("utf-8")),
            path_in_repo=f"{result_file_base_name}.json",
            repo_type="dataset",
        )
+        logger.info(f"Uploaded evaluation details to {url}")


Added this so it's faster to click through to the final dataset in the logs

HuggingFaceDocBuilderDev · 2025-02-04T14:57:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun · 2025-02-04T14:58:21Z

src/lighteval/tasks/extended/gpqa/main.py

+        evaluation_splits=["train"],
+        few_shots_split=None,
+        few_shots_select="random_sampling",
+        generation_size=32_000,


Is it possible to override this at runtime? I've set it to 32k to accomodate the new reasoning models, but let me know if you prefer a saner default!

I would be OK with you defining 2 versions, a base and a reasoning one

Why do you need to add it to extended? It would fit in the base evals imo

OK will move!

clefourrier

lgtm but I would preferably put it in the base metrics

lewtun · 2025-02-04T16:15:25Z

src/lighteval/tasks/default_prompts.py

+    choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]]
+    choices.insert(gold_index, line["Correct Answer"])
+
+    instruction = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering."


@clefourrier @NathanHB how much would you like me to iterate on this prompt. I saw Llama 3 has a fairly detailed one I can test, but I'm unsure if it's just optimised for Llama: https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__gpqa__details?row=0

I think if we can replicate deepskeek's number with this one it's fine but feel free to test !

Add GPQA for instruct models

09c6c7b

lewtun commented Feb 4, 2025

View reviewed changes

Add ref

88f939e

lewtun commented Feb 4, 2025

View reviewed changes

clefourrier approved these changes Feb 4, 2025

View reviewed changes

Refactor

fa00c5f

lewtun commented Feb 4, 2025

View reviewed changes

NathanHB approved these changes Feb 5, 2025

View reviewed changes

lewtun added 4 commits February 5, 2025 10:14

Tune prompt

8b47e01

Merge branch 'main' into add-gpqa-generative

69814ba

Tune max tokens

7390f5b

Use simple-eval template

b3b67b8

lewtun merged commit 1ce7331 into main Feb 5, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPQA for instruct models #534

Add GPQA for instruct models #534

lewtun commented Feb 4, 2025 •

edited

Loading

lewtun Feb 4, 2025

HuggingFaceDocBuilderDev commented Feb 4, 2025

lewtun Feb 4, 2025

clefourrier Feb 4, 2025

clefourrier Feb 4, 2025

lewtun Feb 4, 2025

clefourrier left a comment

lewtun Feb 4, 2025

NathanHB Feb 5, 2025

Add GPQA for instruct models #534

Add GPQA for instruct models #534

Conversation

lewtun commented Feb 4, 2025 • edited Loading

lewtun Feb 4, 2025

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 4, 2025

lewtun Feb 4, 2025

Choose a reason for hiding this comment

clefourrier Feb 4, 2025

Choose a reason for hiding this comment

clefourrier Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

clefourrier left a comment

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

NathanHB Feb 5, 2025

Choose a reason for hiding this comment

lewtun commented Feb 4, 2025 •

edited

Loading