-
Notifications
You must be signed in to change notification settings - Fork 402
[EVAL] SciCode #1086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[EVAL] SciCode #1086
Conversation
|
cc: @NathanHB would love your feedback on this. |
NathanHB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey ! Looking nice :)
Only thing is that it only check if the code has been generated ?
i think you can also use what is there: https://github.com/scicode-bench/SciCode/blob/main/eval/inspect_ai/scicode.py
And give credit / ask them of course :)
|
hey @NathanHB, thanks for addressing the way of checking whether the benchmark is actually running or not. it seems that i didn't dig into the original codebase enough. i noticed that for the target variables, the benchmark used a i uploaded the h5 file to huggingface datasets and you can view it here: scicode-files. although it may not be the most professional decision to keep a repository's supplementary files hosted personally. i also integrated multi-step evaluation, in contrast to the single sub-step evaluation that i initially implemented, which wasn't a faithful benchmark to the actual benchmark itself. i try to run the following command to test the benchmark: lighteval eval openai/gpt-4o scicode \
--bundle-dir gpt-4o-scicode \
--repo-id akshathmangudi/gpt-4o-scicode \
--publichowever my laptop keeps crashing in the middle of the benchmark, and im not able to really fix it. can you please verify the changes and let me know if it's working from your end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements support for the SciCode benchmark, a challenging evaluation suite for testing language models' ability to generate code for scientific research problems. The implementation follows the lighteval framework patterns and integrates with the inspect_ai evaluation system.
Key Changes:
- Adds a complete SciCode evaluation task with multi-step problem solving support
- Implements custom solver that processes sequential sub-steps with code generation
- Includes scoring system that executes generated code against test cases stored in HDF5 format
- Adds scipy and h5py dependencies to handle sparse matrices and test data
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
src/lighteval/tasks/tasks/scicode/__init__.py |
Package initialization exporting the scicode task and TASKS_TABLE |
src/lighteval/tasks/tasks/scicode/main.py |
Main task configuration with prompt functions and Sample/Doc conversion |
src/lighteval/tasks/tasks/scicode/utils.py |
Utility functions for code extraction and H5 file downloading from HuggingFace |
src/lighteval/tasks/tasks/scicode/solver.py |
Custom solver implementing multi-step sequential code generation with skip logic |
src/lighteval/tasks/tasks/scicode/scorer.py |
Scorer that executes generated code via subprocess and calculates correctness metrics |
src/lighteval/tasks/tasks/scicode/prompts.py |
Prompt template and generation functions for single and multi-step problems |
src/lighteval/tasks/tasks/scicode/parse.py |
Parsing utilities for extracting functions, handling HDF5 data structures, and loading test targets |
pyproject.toml |
Adds scipy and h5py dependencies with inline comments explaining their purpose |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…al into akshath/issue-740
Overview
Resolves #740.
Consists of a single-file implementation
scicode.pywhich uses SciCode's prompt structure found inbackground_comment_template.txt.STATUS: ready for review.
HF space link: https://huggingface.co/spaces/akshathmangudi/gpt-4o-scicode