Skip to content

Conversation

@akshathmangudi
Copy link
Contributor

Overview

Resolves #740.

Consists of a single-file implementation scicode.py which uses SciCode's prompt structure found in background_comment_template.txt.

STATUS: ready for review.

HF space link: https://huggingface.co/spaces/akshathmangudi/gpt-4o-scicode

@akshathmangudi
Copy link
Contributor Author

cc: @NathanHB would love your feedback on this.

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey ! Looking nice :)
Only thing is that it only check if the code has been generated ?

i think you can also use what is there: https://github.com/scicode-bench/SciCode/blob/main/eval/inspect_ai/scicode.py

And give credit / ask them of course :)

Copilot AI review requested due to automatic review settings December 14, 2025 14:15
@akshathmangudi
Copy link
Contributor Author

akshathmangudi commented Dec 14, 2025

hey @NathanHB,

thanks for addressing the way of checking whether the benchmark is actually running or not. it seems that i didn't dig into the original codebase enough.

i noticed that for the target variables, the benchmark used a h5 file which was not in access either publicly through HuggingFace datasets or in the repository - i had to find it through browsing scicode's commit history where i found a drive link.

i uploaded the h5 file to huggingface datasets and you can view it here: scicode-files.

although it may not be the most professional decision to keep a repository's supplementary files hosted personally.

i also integrated multi-step evaluation, in contrast to the single sub-step evaluation that i initially implemented, which wasn't a faithful benchmark to the actual benchmark itself.

i try to run the following command to test the benchmark:

lighteval eval openai/gpt-4o scicode \ 
--bundle-dir gpt-4o-scicode \
--repo-id akshathmangudi/gpt-4o-scicode \
--public

however my laptop keeps crashing in the middle of the benchmark, and im not able to really fix it. can you please verify the changes and let me know if it's working from your end?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements support for the SciCode benchmark, a challenging evaluation suite for testing language models' ability to generate code for scientific research problems. The implementation follows the lighteval framework patterns and integrates with the inspect_ai evaluation system.

Key Changes:

  • Adds a complete SciCode evaluation task with multi-step problem solving support
  • Implements custom solver that processes sequential sub-steps with code generation
  • Includes scoring system that executes generated code against test cases stored in HDF5 format
  • Adds scipy and h5py dependencies to handle sparse matrices and test data

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/lighteval/tasks/tasks/scicode/__init__.py Package initialization exporting the scicode task and TASKS_TABLE
src/lighteval/tasks/tasks/scicode/main.py Main task configuration with prompt functions and Sample/Doc conversion
src/lighteval/tasks/tasks/scicode/utils.py Utility functions for code extraction and H5 file downloading from HuggingFace
src/lighteval/tasks/tasks/scicode/solver.py Custom solver implementing multi-step sequential code generation with skip logic
src/lighteval/tasks/tasks/scicode/scorer.py Scorer that executes generated code via subprocess and calculates correctness metrics
src/lighteval/tasks/tasks/scicode/prompts.py Prompt template and generation functions for single and multi-step problems
src/lighteval/tasks/tasks/scicode/parse.py Parsing utilities for extracting functions, handling HDF5 data structures, and loading test targets
pyproject.toml Adds scipy and h5py dependencies with inline comments explaining their purpose

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EVAL] SciCode: reasearch coding benchmark

2 participants