EnterpriseBench

We present EnterpriseBench, a new commercially grounded benchmark designed to evaluate the capabilities of AI agents in solving real-world software engineering tasks.

Overview

Addressing limitations in existing benchmarks, we introduce two versions: one based on SWE-bench methodology, featuring a curated set of high-quality selected tasks, and another employing a test-driven development (TDD) paradigm with 147 selected tasks across 3 repositories. Tasks originate from authentic enterprise Jira tickets and cover diverse issue types including bug fixes, and feature implementations. Visual task elements are transformed into textual descriptions using multimodal models. To improve experimentation efficiency, we propose a novel cost efficient strategy based on early agent-model pair selection using limited repositories. Additionally, we introduce experimental stub projects methodology and data, to assess agent performance in complex pipeline construction, offering a stripped-down project skeleton with matching tickets and tests. The benchmark was tested on state of the art AI coding agents. Our dataset is unique in its exclusive use of proprietary commercial data, preventing answer leakage and ensuring non-contamination of current LLM training sets.

Prerequisites

# Clone the framework
$ git clone https://github.com/exadel-inc/EnterpriseBench.git
$ cd EnterpriseBench

Tool	Version	Notes
Ubuntu	20.04.6 LTS (tested)	Other OSes are not yet supported
Java Development Kit (JDK)	AuthoringToolKit (JDK8) CompreFace (JDK17) DynamicMailboxes (JDK11)	Set `JVM_DIR` to the JDK home if it is not on your PATH
Maven	Apache Maven 3.6.3	Used to build the target repo
Python	3.12	Required for the orchestration scripts
Git	latest	Required for checking out historical commits

🗂️ Note: When working with the AuthoringToolKit repository you must force the benchmark to use Java 8 by adding --java-major 8 to the command line of both 4_run_all_tickets.py and any direct calls to 3_run_ticket_test.py.

Dataset Preparation

Automatic setup (recommended)

🗂️ Simply run utils/install_dependencies.sh to install required system dependencies and utils/prepare_dataverse.sh to download the Harvard Dataverse archive if needed, which creates the corrects folder layout and renames/unpacks everything exactly as required.

The install_dependencies.sh script will:

Ensure bash, curl, git, and unzip are installed.
Install Apache Maven 3.6.3 if mvn is missing.
Install Java SDKs 8, 11, and 17 (on Ubuntu/Debian; other distros print a hint).
Ensure pip3 is available (installing python3-pip if missing).
Install the pandas Python package.

The prepare_dataverse.sh script will:

Download the Harvard Dataverse archive (DOI 10.7910/DVN/S4WOTJ) if dataverse_files.zip is not already present.
Extract the archive into a clean dataverse_files/ directory – unless that folder already exists and is non‑empty, in which case the script skips all extraction & rename work.
Inside every project subfolder it
• renames *.csv → pr_states.csv
• unpacks patches_neg*/patches_pos* ZIPs into flat patches_neg/ and patches_pos/ folders
• unzips the main repo archive into a flat project_repo/ folder
• creates a jvm symlink pointing to /usr/lib/jvm (so the benchmark finds all installed JDKs).

🗂️ Re‑running the script is idempotent: it detects an existing dataverse_files/ directory and exits without touching your data or reinstalling the JDKs.

After the script finishes, point --project-root at one of the unpacked project sub‑folders (e.g., dataverse_files/CompreFace) and jump straight to the Running the Benchmark section.

Manual setup

EnterpriseBench expects the following artefacts for every benchmark run:

project_root - provide the root directory of the benchmark project by calling the script with the --project-root argument in the 4_run_all_tickets.py script.
pr_states.csv – the mapping between issue/ticket IDs and the commit SHA(s) that resolved them.
project_repo – the full Git history of the benchmark project.
patches_neg/ – negative git diff patches.
patches_ai/ – AI agent git diff patches (default; can be overridden via --ai-patches-dir).

🗂️ Rename / copy your dataset file to pr_states.csv (e.g. dataset_CF_anonymized.csv → pr_states.csv). The scripts look for that exact filename by default.

The directory layout with necessary files should look like this:

project_root/
├── pr_states.csv
├── project_repo/         # cloned target project
├── patches_neg/
│   ├── <ticket1>_non_test.diff
│   └── ...
├── patches_ai/
│   ├── <patch_set1>/
│   │   ├── <ticket1>_non_test.diff
│   │   └── ...
│   └── <patch_set2>/
│       ├── <ticket1>_non_test.diff
│       └── ...

Running the Benchmark

$ python3 4_run_all_tickets.py --project-root dataverse_files/CompreFace

AI patches

The following commands apply your AI‑generated patch sets to each of the three benchmark projects that ship in the Harvard Dataverse archive.
Adjust the --ai-patches-dir argument to point at the directory that contains your <ticket>_non_test.diff files. If --ai-patches-dir is omitted, the script defaults to the patches_ai directory within the project root. The script supports multiple AI patch sets. If the provided AI patch directory contains subfolders, each is treated as a distinct patch set and processed separately.

$ python3 4_run_all_tickets.py --ai --project-root dataverse_files/CompreFace

Golden patches

Place the golden patches in the patches_pos/ directory under the project root (e.g., dataverse_files/CompreFace/patches_pos).

Single patch

$ python3 3_run_ticket_test.py MM-62925 patches_pos/MM-62925_non_test.diff

Measure scores

Results from each run are saved in the test_results.csv CSV file and in the results/ directory. This is a helper script to summarize and display results from benchmark runs.

$ python3 5_measure_scores.py dataverse_files/CompreFace

Examples with the public dataverse_files/ dataset

# 1) AuthoringToolKit — this repo must be built with Java 8
python3 4_run_all_tickets.py \
  --project-root dataverse_files/AuthoringToolKit \
  --java-major 8 \
  --ai \
  --ai-patches-dir PATCHES_EAK_TDD_DEEPSEEK_mSWE_AGENT_CL2

# 2) CompreFace
python3 4_run_all_tickets.py \
  --project-root dataverse_files/CompreFace \
  --ai \
  --ai-patches-dir PATCHES_CF_classic_GPT_4o_MINI_mSWE_AGENT_CL_1

# 3) DynamicMailboxes
python3 4_run_all_tickets.py \
  --project-root dataverse_files/DynamicMailboxes \
  --ai \
  --ai-patches-dir PATCHES_DMB_classic_GPT_4o_MINI_mSWE_AGENT_CL_1

Command‑line flags

3_run_ticket_test.py

Flag	Purpose	Default
`TICKET`	PR ticket ID to test (positional)	required
`PATCH`	Optional diff file (`<ticket>_non_test.diff`)	—
`--ai`	Skip base + merge stages; run only negative/code stage	off
`--project-root PATH`	Root of the benchmark project	script’s folder
`--java-major N`	Force Java major version (e.g., 8, 17)	highest JDK found

4_run_all_tickets.py

Flag	Purpose	Default
`--ai`	Run only the AI‑patch stage (skips base + merge)	off
`--ai-patches-dir PATH`	Directory containing `<ticket>_non_test.diff` files. If the flag is omitted, the script defaults to the `patches_ai` directory within the project root.	`patches_ai` project folder
`--project-root PATH`	Root of the benchmark project	script’s folder
`--java-major N`	Force Java major version (e.g., 8, 17)	highest JDK found

5_measure_scores.py

Argument	Purpose	Default
`<folder_path>`	Directory containing CSV files to summarize and display	required
`-h`, `--help`	Show the help message	N/A

All parameters are documented via -h/--help.

Troubleshooting & FAQ

Symptom	Fix
`java: command not found`	Check your JDK installation and `JVM_DIR`.
Maven can’t resolve dependencies	Make sure the target project builds without EnterpriseBench first.
"FileNotFound: pr_states.csv"	Confirm you renamed your dataset correctly or pass `--dataset` to the script.

License

Distributed under the Apache 2.0 license – see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EnterpriseBench

Table of Contents

Overview

Prerequisites

Dataset Preparation

Automatic setup (recommended)

Manual setup

Running the Benchmark

AI patches

Golden patches

Single patch

Measure scores

Examples with the public dataverse_files/ dataset

Command‑line flags

3_run_ticket_test.py

4_run_all_tickets.py

5_measure_scores.py

Troubleshooting & FAQ

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
utils		utils
.gitignore		.gitignore
3_run_ticket_test.py		3_run_ticket_test.py
4_run_all_tickets.py		4_run_all_tickets.py
5_measure_scores.py		5_measure_scores.py
LICENSE		LICENSE
README.md		README.md

License

exadel-inc/EnterpriseBench

Folders and files

Latest commit

History

Repository files navigation

EnterpriseBench

Table of Contents

Overview

Prerequisites

Dataset Preparation

Automatic setup (recommended)

Manual setup

Running the Benchmark

AI patches

Golden patches

Single patch

Measure scores

Examples with the public dataverse_files/ dataset

Command‑line flags

3_run_ticket_test.py

4_run_all_tickets.py

5_measure_scores.py

Troubleshooting & FAQ

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages