[EVAL] SciCode #1086

akshathmangudi · 2025-11-27T08:02:33Z

Overview

Resolves #740.

Consists of a single-file implementation scicode.py which uses SciCode's prompt structure found in background_comment_template.txt.

STATUS: ready for review.

HF space link: https://huggingface.co/spaces/akshathmangudi/gpt-4o-scicode

akshathmangudi · 2025-11-27T08:03:06Z

cc: @NathanHB would love your feedback on this.

NathanHB

Hey ! Looking nice :)
Only thing is that it only check if the code has been generated ?

i think you can also use what is there: https://github.com/scicode-bench/SciCode/blob/main/eval/inspect_ai/scicode.py

And give credit / ask them of course :)

src/lighteval/tasks/tasks/scicode.py

akshathmangudi · 2025-12-14T14:20:38Z

hey @NathanHB,

thanks for addressing the way of checking whether the benchmark is actually running or not. it seems that i didn't dig into the original codebase enough.

i noticed that for the target variables, the benchmark used a h5 file which was not in access either publicly through HuggingFace datasets or in the repository - i had to find it through browsing scicode's commit history where i found a drive link.

i uploaded the h5 file to huggingface datasets and you can view it here: scicode-files.

although it may not be the most professional decision to keep a repository's supplementary files hosted personally.

i also integrated multi-step evaluation, in contrast to the single sub-step evaluation that i initially implemented, which wasn't a faithful benchmark to the actual benchmark itself.

i try to run the following command to test the benchmark:

lighteval eval openai/gpt-4o scicode \ 
--bundle-dir gpt-4o-scicode \
--repo-id akshathmangudi/gpt-4o-scicode \
--public

however my laptop keeps crashing in the middle of the benchmark, and im not able to really fix it. can you please verify the changes and let me know if it's working from your end?

Copilot

Pull request overview

This PR implements support for the SciCode benchmark, a challenging evaluation suite for testing language models' ability to generate code for scientific research problems. The implementation follows the lighteval framework patterns and integrates with the inspect_ai evaluation system.

Key Changes:

Adds a complete SciCode evaluation task with multi-step problem solving support
Implements custom solver that processes sequential sub-steps with code generation
Includes scoring system that executes generated code against test cases stored in HDF5 format
Adds scipy and h5py dependencies to handle sparse matrices and test data

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`src/lighteval/tasks/tasks/scicode/__init__.py`	Package initialization exporting the scicode task and TASKS_TABLE
`src/lighteval/tasks/tasks/scicode/main.py`	Main task configuration with prompt functions and Sample/Doc conversion
`src/lighteval/tasks/tasks/scicode/utils.py`	Utility functions for code extraction and H5 file downloading from HuggingFace
`src/lighteval/tasks/tasks/scicode/solver.py`	Custom solver implementing multi-step sequential code generation with skip logic
`src/lighteval/tasks/tasks/scicode/scorer.py`	Scorer that executes generated code via subprocess and calculates correctness metrics
`src/lighteval/tasks/tasks/scicode/prompts.py`	Prompt template and generation functions for single and multi-step problems
`src/lighteval/tasks/tasks/scicode/parse.py`	Parsing utilities for extracting functions, handling HDF5 data structures, and loading test targets
`pyproject.toml`	Adds scipy and h5py dependencies with inline comments explaining their purpose

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lighteval/tasks/tasks/scicode/solver.py

src/lighteval/tasks/tasks/scicode/utils.py

src/lighteval/tasks/tasks/scicode/main.py

src/lighteval/tasks/tasks/scicode/scorer.py

src/lighteval/tasks/tasks/scicode/parse.py

src/lighteval/tasks/tasks/scicode/utils.py

src/lighteval/tasks/tasks/scicode/solver.py

src/lighteval/tasks/tasks/scicode/scorer.py

src/lighteval/tasks/tasks/scicode/parse.py

src/lighteval/tasks/tasks/scicode/prompts.py

…al into akshath/issue-740

akshathmangudi added 2 commits November 27, 2025 13:27

scicode integration

89b0502

Merge branch 'main' into akshath/issue-740

b68785d

NathanHB reviewed Dec 4, 2025

View reviewed changes

src/lighteval/tasks/tasks/scicode.py Outdated Show resolved Hide resolved

NathanHB added the new-task label Dec 9, 2025

fixing impl + addressing changes

fbd9e9d

Copilot AI review requested due to automatic review settings December 14, 2025 14:15

Merge branch 'main' into akshath/issue-740

442dd0c

Copilot started reviewing on behalf of akshathmangudi December 14, 2025 14:16 View session

Copilot AI reviewed Dec 14, 2025

View reviewed changes

akshathmangudi added 2 commits December 14, 2025 22:48

addressed copilot comments

acb1c2d

Merge branch 'akshath/issue-740' of github.com:akshathmangudi/lightev…

b5fb234

…al into akshath/issue-740

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EVAL] SciCode #1086

[EVAL] SciCode #1086

Uh oh!

akshathmangudi commented Nov 27, 2025

Uh oh!

akshathmangudi commented Nov 27, 2025

Uh oh!

NathanHB left a comment

Uh oh!

Uh oh!

akshathmangudi commented Dec 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[EVAL] SciCode #1086

Are you sure you want to change the base?

[EVAL] SciCode #1086

Uh oh!

Conversation

akshathmangudi commented Nov 27, 2025

Overview

Uh oh!

akshathmangudi commented Nov 27, 2025

Uh oh!

NathanHB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akshathmangudi commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akshathmangudi commented Dec 14, 2025 •

edited

Loading