[Alert!] CodeMind is a work in progress. We are actively modifying the code to make it easier for end users to reproduce the results or add new models/datasets/reasoning tasks. Please create issues if you observe difficulties in using CodeMind.
Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three inductive code reasoning tasks: (1) Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). Please follow the instructions below to reproduce the results to use existing models, tasks, and datasets. We also support adding new models, tasks, and datasets.
To install all the dependencies, run the following command: pip install -r requirements.txt
CodeMind is designed to read API keys required for API-access models from local variables. Please modify and run setup.sh {OPANAIKEY}
to automatically add the variable to your local machines.
cd scripts
bash run_IER.sh {MODEL_ID} {CACHE_DIR} {DATASET} {PL}
## below is the command to run Magicoder on MBPP:
bash run_IER.sh ise-uiuc/Magicoder-S-DS-6.7B ${path_to_store_checkpoints} mbpp Python
## elow is the command to run Magicoder on CodeNet Java:
bash run_IER.sh ise-uiuc/Magicoder-S-DS-6.7B ${path_to_store_checkpoints} CodeNet Java
MODEL_ID
: Currently our framework supports following OpenAI and huggingface models: gpt-3.5-turbo
, gpt-4-1106-preview
, codellama/CodeLlama-13b-Instruct-hf
, codellama/CodeLlama-13b-hf
, Qwen/CodeQwen1.5-7B-Chat
, Qwen/CodeQwen1.5-7B
, deepseek-ai/deepseek-coder-6.7b-instruct
, deepseek-ai/deepseek-coder-6.7b-base
, meta-llama/Llama-2-13b-hf
, ise-uiuc/Magicoder-S-DS-6.7B
, mistralai/Mistral-7B-Instruct-v0.1
, bigcode/starcoder
, bigcode/starcoder2-15b
, WizardLM/WizardCoder-15B-V1.0
CACHE_DIR
: path to store the downloaded pretrained huggingface model checkpoints.
DATASET
: choose one from the following list [CodeNet, Avatar, cruxeval, mbpp, humaneval]
cd scripts
bash run_DER.sh {MODEL_ID} {DATASET} {CACHAE_DIR} {TASK} [SRC_LANG] [TGT_LANG]
## Below is the command to run DER(TASK=Synthesis) for CodeLlama-instruct on MBPP
bash run_DER.sh codellama/CodeLlama-13b-Instruct-hf mbpp ${path_to_store_checkpoints} Synthesis
## Below is the command to run DER(TASK=Translation) for CodeLlama-instruct on CodeNet
bash run_DER.sh codellama/CodeLlama-13b-Instruct-hf CodeNet ${path_to_store_checkpoints} Translation Java Python
Task
: can be 'Synthesis'(code synthesis) or 'Translation'(code translation)
SRC_LANG
and TGT_LANG
: optional, required when running code translation. Our framework currently supports Python and Java.
cd scripts
bash run_SR.sh {MODEL_ID} {DATASET} {CACHAE_DIR} {TASK} {SR_TYPE} [SRC_LANG] [TGT_LANG]
## Below is the command to run SR(TASK=Synthesis) for Deepseek-coder on MBPP under 'no_test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc mbpp ${path_to_store_checkpoints} Synthesis no_test
## Below is the command to run SR(TASK=Translation) for Deepseek-coder on CodeNet under 'misleading test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc CodeNet ${path_to_store_checkpoints} Translation misleading_test Java Python
SR_TYPE
: can be 'no_test', 'with_test' or 'misleading_test'
You have two options to evaluate a new model using CodeMind:
Option 1: Open an issue on the repo's issue tracker and label it with "new_model." We will resolve the issue by adding the new model per each "new_model" request.
Option 2: You can modify the "model_config.json" by adding properties of the new model, such as model ID and interface type. You should also modify "create_prompt_ier.py" and "reate_prompt_der.py" scripts, as different models may require additional information in the prompt that is currently not supported by our scripts.
To add a new reasoning task to CodeMind, please open an issue on the repo's issue tracker and label it with "new_task." We will provide additional information about how to integrate your new reasoning tasks into CodeMind.
Datasets are stored under the /Dataset
directory. If a dataset contains instances from different programming languages, they should be separated into separate sub-directories (similar to the structure below).
+---Avatar
| |
| +---Avatar-java
| |
| +---Avatar-python
|
+---CodeNet
| |
| +---CodeNet-java
| |
| +---CodeNet-python
|
+---CRUXEval
|
+---HumanEval
|
+---MBPP
Given that different datasets read test data differently, please open an issue on the repo's issue tracker to add a new dataset to CodeMind.
Interested to read more about CodeMind, the code reasoning tasks, and a grounded-theory study evaluating LLMs for code reasoning across five benchmarks and two programming languages? Please read the pre-print on Arxiv: https://arxiv.org/pdf/2402.09664.pdf
citiation:
@article{liu2024codemind,
title={CodeMind: A Framework to Challenge Large Language Models for Code Reasoning},
author={Liu, Changshu and Zhang, Shizhuo Dylan and Ibrahimzada, Ali Reza and Jabbarvand, Reyhaneh},
journal={arXiv preprint arXiv:2402.09664},
year={2024}
}
We also upload our artifact to Zenodo and we have a license: 10.5281/zenodo.10699284.
CodeMind is an open-source project to promote the proper evaluation of LLMs for code-related tasks. If you are interested in building on top of CodeMind and adding more code reasoning tasks, please send an email to {cl144,reyhaneh}@illinois.edu.