Skip to content

Refresh AITK Bulk Run Page #8471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 69 additions & 17 deletions docs/intelligentapps/bulkrun.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,89 @@
---
ContentId: 1124d141-e893-4780-aba7-b6ca13628bc5
DateApproved: 12/11/2024
MetaDescription: Run a set of prompts in an imported dataset, individually or in a full batch towards the selected genAI models and parameters.
DateApproved: 06/16/2025
MetaDescription: Run a set of prompts with variables or function calls with an imported or synthetically generated dataset towards the selected models and parameters.
---
# Run multiple prompts in bulk

The bulk run feature in AI Toolkit enables you to run multiple prompts in batch. When you use the playground, you can only run one prompt manually at a time, in the order they're listed.
> [!NOTE]
> Bulk run was previously a standalone webview feature in AI Toolkit. It is now fully integrated into **Agent Builder** under the **Evaluation** tab. You can still access it through the AI Toolkit view by selecting **TOOLS** > **Bulk Run**.

Bulk run takes a dataset as input, where each row in the dataset has at least a prompt. Typically, the dataset has multiple rows. Once imported, you can select one or more prompts to run on the selected model. The responses are then displayed in the same dataset view. The results from running the dataset can be exported.
The bulk run feature in AI Toolkit lets you test agents and prompts against multiple test cases in batch mode. Unlike the playground, which runs one prompt at a time, bulk run automates the process by using a dataset as input and running all prompts sequentially.

After execution, AI responses appear in the dataset view next to your original prompts. You can review, compare, and export the complete dataset with responses for further analysis.

![Screenshot showing AI Toolkit interface with the bulk run feature. The dataset table displays multiple prompts and responses, with queries about weather in Paris France and Shanghai China.](./images/bulkrun/bulkrun.png)

## Start a bulk run

1. In the AI Toolkit view, select **TOOLS** > **Bulk Run** to open the Bulk Run view
To start a bulk run in AI Toolkit, follow these steps:

1. In the AI Toolkit view, select **Agent Builder** from the Activity Bar.
1. Enter your prompt and variables using the `{{your_variable}}` format. Select a model to run the prompt against.
1. Switch to the **Evaluation** tab in **Agent Builder**.

> [!NOTE]
> AI Toolkit uses the same LLM models you use for agents to generate datasets, which might incur costs. You can view the meta prompt used to generate datasets in the [AI Toolkit GitHub repository](https://github.com/microsoft/vscode-ai-toolkit/blob/main/doc/data_generator.md).

1. Select **Generate Data** to create a synthetic dataset.
1. Choose the number of rows to generate and view or modify the data generation logic.
![Screenshot showing Generate Data dialog in AI Toolkit.](./images/bulkrun/generate_data.png)
1. Select **Generate** to create the dataset.

> [!TIP]
> You can choose to run only the remaining queries that have not yet been run.

1. Once the dataset is loaded, select **Run** to run a single row or **Run All** to run all rows in the dataset.

## Operate on dataset

![Screenshot showing AI Toolkit interface with dataset operations and a table of evaluation results.](./images/bulkrun/dataset_operation.png)

AI Toolkit provides several operations to manage and analyze your dataset during a bulk run:

- **Generate Data**: Create a synthetic dataset based on a prompt and variables. Specify the number of rows and modify the data generation logic.
- **Add Row**: Add a new row to the dataset.
- **Delete Row**: Delete the selected row from the dataset.
- **Export Dataset**: Export the dataset to a CSV file for further analysis or reporting.
- **Import Dataset**: Import a dataset from a CSV file to use as input for the bulk run.
- **Run**: Execute a single row in the dataset against the selected model.
- **Run All**: Execute all rows in the dataset against the selected model.
- **Run Remaining**: Execute only the rows that have not yet been run against the selected model.
- **Manual Evaluation**: Mark responses as Thumb Up or Thumb Down to keep a record of manual evaluations.

## Evaluate bulk run results

AI Toolkit lets you evaluate the results of your bulk run directly in the dataset view.

![Screenshot showing AI Toolkit interface in full screen mode with the Evaluation tab expanded. The dataset table displays multiple columns, including query prompts and AI responses, for detailed analysis.](./images/bulkrun/full_screen.png)

You can expand the **Evaluation** tab to full screen mode for a more detailed view of the results. Full screen mode provides the same functionality as the standard view, but with a larger display area for better visibility and analysis.

1. Select either a sample dataset or import a local [JSONL](https://jsonlines.org/) file with chat prompts
![Screenshot showing detailed view of evaluation results with a modal dialog displaying a full conversation between user and assistant about weather queries.](./images/bulkrun/view_detail.png)

The JSONL file needs to have a `query` field to represent a prompt.
Select **View Details** to see the full response for each query.

1. Once the dataset is loaded, select **Run** or **Rerun** on any prompt to run a single prompt.
In the detail view, you can:

Similar to testing a model in the playground, select a model, add context for your prompt, and change inference parameters.
- Review the full conversation between the user and the assistant.
- Analyze the AI's responses.
- Mark responses as good or bad to keep a record of manual evaluations.
- Navigate to previous or next queries in the dataset.
- Select **Exit** to return to the dataset overview.
- View the total number of queries in the dataset and the current query index.

![Bulk run prompts](./images/bulkrun/bulkrun_one.png)
## Manage data columns

1. Select **Run all** to automatically run through all queries.
![Screenshot showing AI Toolkit interface with dataset management options and column management controls.](./images/bulkrun/manage_columns.png)

The model responses are shown in the **response** column.
With data column management, you can customize the dataset view to focus on the most relevant information for your bulk run analysis.

![Run all](./images/bulkrun/runall.png)
You can:

> [!TIP]
> There is an option to only run the remaining queries that have not yet been run.
- **Add Columns**: Add columns to the left or right of the current column.
- **Edit Column Name**: Change the name of any column in the dataset.
- **Add Ground Truth Column**: Add a column for ground truth values to compare with AI responses.

1. Select the **Export** button to export the results to a JSONL format
## Next steps

1. Select **Import** to import another dataset in JSONL format for the bulk run
- [Run an evaluation](/docs/intelligentapps/evaluation.md) with the popular evaluators
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/bulkrun.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/dataset_operation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/full_screen.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/generate_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/manage_columns.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/intelligentapps/images/bulkrun/view_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.