Surface red team scan errors in run results by slister1001 · Pull Request #45772 · Azure/azure-sdk-for-python

slister1001 · 2026-03-18T14:37:17Z

When all attacks fail due to a configuration error (e.g., unavailable model), the run previously completed with 0 results and no error message. Users had no way to understand what went wrong.

Changes:

Add error field to RedTeamRun TypedDict for run-level error reporting
Add _aggregate_run_errors() to ResultProcessor to collect per-category errors from red_team_info into a structured run-level error
Classify HTTP 400 as CONFIGURATION in ExceptionHandler so systemic config errors (unavailable_model, bad credentials) are detected
Add early abort in Foundry execution manager after 2 consecutive configuration errors to avoid wasting time on remaining categories
Add _is_configuration_error() helper to detect systemic issues

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

When all attacks fail due to a configuration error (e.g., unavailable model), the run previously completed with 0 results and no error message. Users had no way to understand what went wrong. Changes: - Add error field to RedTeamRun TypedDict for run-level error reporting - Add _aggregate_run_errors() to ResultProcessor to collect per-category errors from red_team_info into a structured run-level error - Classify HTTP 400 as CONFIGURATION in ExceptionHandler so systemic config errors (unavailable_model, bad credentials) are detected - Add early abort in Foundry execution manager after 2 consecutive configuration errors to avoid wasting time on remaining categories - Add _is_configuration_error() helper to detect systemic issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR improves red team scan diagnostics by surfacing run-level failures (especially systemic configuration issues) in the final run results, and by aborting Foundry execution early when repeated configuration failures indicate the rest of the scan would be wasted work.

Changes:

Add an optional error field to the RedTeamRun TypedDict to report run-level failures.
Add ResultProcessor._aggregate_run_errors() to summarize per-category failures from red_team_info into a structured run-level error.
Improve systemic error detection by classifying HTTP 400 as a configuration error and aborting Foundry execution after consecutive configuration-like failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/exception_utils.py	Updates HTTP status categorization to treat 400 as configuration-related.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py	Adds run-level error aggregation and attaches it to the run payload when status is failed.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team_result.py	Extends `RedTeamRun` schema with an optional `error` field.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_execution_manager.py	Adds configuration-error detection and early abort logic after repeated config failures.

Copilot · 2026-03-19T19:47:35Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py

+                    error_msg = details.get("error")
+                    if error_msg:
+                        errors.append(f"{risk_value}: {error_msg}")
+


errors.append(f"{risk_value}: {error_msg}") bakes the risk category into the string, which prevents the later dict.fromkeys(errors) step from actually deduplicating identical underlying errors across categories (each entry will differ by risk_value). If you want run-level dedup for systemic config issues, consider deduping on the raw error_msg (and optionally tracking which categories share it).

Copilot · 2026-03-19T19:47:35Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py

+    @staticmethod
+    def _aggregate_run_errors(
+        red_team_info: Optional[Dict],
+    ) -> Optional[Dict[str, Any]]:
+        """Collect error messages from failed risk categories into a run-level error.
+
+        Scans red_team_info for tasks with failed/incomplete/timeout status and
+        extracts their error messages. Deduplicates identical errors (common when
+        a config issue like an unavailable model affects all categories).


This new run-level error aggregation logic doesn’t appear to be covered by unit tests. Since tests/unittests/test_redteam/ already has coverage for other ResultProcessor helpers, it would be good to add tests for: (1) status==failed producing a non-null error, (2) multiple category failures producing a stable summary message, and (3) dedup behavior for identical error messages.

Copilot · 2026-03-19T19:47:36Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_execution_manager.py

+                        # Track consecutive failures to detect systemic issues
+                        # (e.g., unavailable model, bad credentials)
+                        if self._is_configuration_error(e):
+                            consecutive_config_failures += 1
+                            if consecutive_config_failures >= _MAX_CONSECUTIVE_CONFIG_FAILURES:
+                                remaining = [
+                                    rc.value
+                                    for rc in risk_categories
+                                    if rc.value
+                                    not in {rv for rd in red_team_info.values() if isinstance(rd, dict) for rv in rd}
+                                ]
+                                if remaining:
+                                    abort_msg = (
+                                        f"Aborting remaining {len(remaining)} risk categories "
+                                        f"after {consecutive_config_failures} consecutive configuration errors. "
+                                        f"Root cause: {e}"
+                                    )
+                                    self.logger.error(abort_msg)
+                                    for rv in remaining:
+                                        if "Foundry" not in red_team_info:
+                                            red_team_info["Foundry"] = {}
+                                        red_team_info["Foundry"][rv] = {
+                                            "data_file": "",
+                                            "status": "failed",
+                                            "error": str(e),
+                                            "asr": 0.0,
+                                        }
+                                    break


The new early-abort behavior after consecutive configuration errors is not covered by tests. Since tests/unittests/test_redteam/test_foundry.py already exercises execute_attacks, please add an async test that forces two consecutive configuration-like failures (e.g., mocked exception with response.status_code=400) and asserts: remaining risk categories are marked failed in red_team_info, and the loop aborts without attempting further orchestrator executions.

Copilot · 2026-03-19T19:47:36Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_execution_manager.py


        red_team_info: Dict[str, Dict[str, Any]] = {}
+        consecutive_config_failures = 0
+        _MAX_CONSECUTIVE_CONFIG_FAILURES = 2


_MAX_CONSECUTIVE_CONFIG_FAILURES is a function-local variable but is named like a module constant (uppercase with leading underscore). For readability/consistency, consider either making it a class-level constant (e.g., MAX_CONSECUTIVE_CONFIG_FAILURES) or renaming it to a local variable style (e.g., max_consecutive_config_failures).

Suggested change

_MAX_CONSECUTIVE_CONFIG_FAILURES = 2

max_consecutive_config_failures = 2

Copilot · 2026-03-19T19:47:36Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/exception_utils.py

        # HTTP status code specific errors
        if hasattr(exception, "response") and hasattr(exception.response, "status_code"):
            status_code = exception.response.status_code
-            if 500 <= status_code < 600:
+            if status_code == 400:
+                return ErrorCategory.CONFIGURATION
+            elif 500 <= status_code < 600:
                return ErrorCategory.NETWORK
            elif status_code == 401:
                return ErrorCategory.AUTHENTICATION


The new HTTP 400 handling here is likely never reached for common httpx errors (e.g., httpx.HTTPStatusError), because earlier in this method network_exceptions includes httpx.HTTPError and the code returns ErrorCategory.NETWORK before checking exception.response.status_code. Consider handling HTTPStatusError (or status codes) before the broad httpx.HTTPError isinstance check, or removing httpx.HTTPError from the network_exceptions tuple so 4xx/5xx responses can be categorized as intended.

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 18, 2026

Merge branch 'main' into fix/redteam-error-surfacing

a57a264

slister1001 marked this pull request as ready for review March 19, 2026 19:40

slister1001 requested a review from a team as a code owner March 19, 2026 19:40

Copilot AI review requested due to automatic review settings March 19, 2026 19:40

Copilot started reviewing on behalf of slister1001 March 19, 2026 19:42 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface red team scan errors in run results#45772

Surface red team scan errors in run results#45772
slister1001 wants to merge 2 commits intoAzure:mainfrom
slister1001:fix/redteam-error-surfacing

slister1001 commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	_MAX_CONSECUTIVE_CONFIG_FAILURES = 2
	max_consecutive_config_failures = 2

Conversation

slister1001 commented Mar 18, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants