Skip to content

Surface red team scan errors in run results#45772

Open
slister1001 wants to merge 2 commits intoAzure:mainfrom
slister1001:fix/redteam-error-surfacing
Open

Surface red team scan errors in run results#45772
slister1001 wants to merge 2 commits intoAzure:mainfrom
slister1001:fix/redteam-error-surfacing

Conversation

@slister1001
Copy link
Member

When all attacks fail due to a configuration error (e.g., unavailable model), the run previously completed with 0 results and no error message. Users had no way to understand what went wrong.

Changes:

  • Add error field to RedTeamRun TypedDict for run-level error reporting
  • Add _aggregate_run_errors() to ResultProcessor to collect per-category errors from red_team_info into a structured run-level error
  • Classify HTTP 400 as CONFIGURATION in ExceptionHandler so systemic config errors (unavailable_model, bad credentials) are detected
  • Add early abort in Foundry execution manager after 2 consecutive configuration errors to avoid wasting time on remaining categories
  • Add _is_configuration_error() helper to detect systemic issues

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

When all attacks fail due to a configuration error (e.g., unavailable model),
the run previously completed with 0 results and no error message. Users had
no way to understand what went wrong.

Changes:
- Add error field to RedTeamRun TypedDict for run-level error reporting
- Add _aggregate_run_errors() to ResultProcessor to collect per-category
  errors from red_team_info into a structured run-level error
- Classify HTTP 400 as CONFIGURATION in ExceptionHandler so systemic
  config errors (unavailable_model, bad credentials) are detected
- Add early abort in Foundry execution manager after 2 consecutive
  configuration errors to avoid wasting time on remaining categories
- Add _is_configuration_error() helper to detect systemic issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 18, 2026
@slister1001 slister1001 marked this pull request as ready for review March 19, 2026 19:40
@slister1001 slister1001 requested a review from a team as a code owner March 19, 2026 19:40
Copilot AI review requested due to automatic review settings March 19, 2026 19:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves red team scan diagnostics by surfacing run-level failures (especially systemic configuration issues) in the final run results, and by aborting Foundry execution early when repeated configuration failures indicate the rest of the scan would be wasted work.

Changes:

  • Add an optional error field to the RedTeamRun TypedDict to report run-level failures.
  • Add ResultProcessor._aggregate_run_errors() to summarize per-category failures from red_team_info into a structured run-level error.
  • Improve systemic error detection by classifying HTTP 400 as a configuration error and aborting Foundry execution after consecutive configuration-like failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/exception_utils.py Updates HTTP status categorization to treat 400 as configuration-related.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Adds run-level error aggregation and attaches it to the run payload when status is failed.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team_result.py Extends RedTeamRun schema with an optional error field.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_execution_manager.py Adds configuration-error detection and early abort logic after repeated config failures.

Comment on lines +1699 to +1702
error_msg = details.get("error")
if error_msg:
errors.append(f"{risk_value}: {error_msg}")

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.append(f"{risk_value}: {error_msg}") bakes the risk category into the string, which prevents the later dict.fromkeys(errors) step from actually deduplicating identical underlying errors across categories (each entry will differ by risk_value). If you want run-level dedup for systemic config issues, consider deduping on the raw error_msg (and optionally tracking which categories share it).

Copilot uses AI. Check for mistakes.
Comment on lines +1668 to +1676
@staticmethod
def _aggregate_run_errors(
red_team_info: Optional[Dict],
) -> Optional[Dict[str, Any]]:
"""Collect error messages from failed risk categories into a run-level error.

Scans red_team_info for tasks with failed/incomplete/timeout status and
extracts their error messages. Deduplicates identical errors (common when
a config issue like an unavailable model affects all categories).
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new run-level error aggregation logic doesn’t appear to be covered by unit tests. Since tests/unittests/test_redteam/ already has coverage for other ResultProcessor helpers, it would be good to add tests for: (1) status==failed producing a non-null error, (2) multiple category failures producing a stable summary message, and (3) dedup behavior for identical error messages.

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +235
# Track consecutive failures to detect systemic issues
# (e.g., unavailable model, bad credentials)
if self._is_configuration_error(e):
consecutive_config_failures += 1
if consecutive_config_failures >= _MAX_CONSECUTIVE_CONFIG_FAILURES:
remaining = [
rc.value
for rc in risk_categories
if rc.value
not in {rv for rd in red_team_info.values() if isinstance(rd, dict) for rv in rd}
]
if remaining:
abort_msg = (
f"Aborting remaining {len(remaining)} risk categories "
f"after {consecutive_config_failures} consecutive configuration errors. "
f"Root cause: {e}"
)
self.logger.error(abort_msg)
for rv in remaining:
if "Foundry" not in red_team_info:
red_team_info["Foundry"] = {}
red_team_info["Foundry"][rv] = {
"data_file": "",
"status": "failed",
"error": str(e),
"asr": 0.0,
}
break
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new early-abort behavior after consecutive configuration errors is not covered by tests. Since tests/unittests/test_redteam/test_foundry.py already exercises execute_attacks, please add an async test that forces two consecutive configuration-like failures (e.g., mocked exception with response.status_code=400) and asserts: remaining risk categories are marked failed in red_team_info, and the loop aborts without attempting further orchestrator executions.

Copilot uses AI. Check for mistakes.

red_team_info: Dict[str, Dict[str, Any]] = {}
consecutive_config_failures = 0
_MAX_CONSECUTIVE_CONFIG_FAILURES = 2
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_MAX_CONSECUTIVE_CONFIG_FAILURES is a function-local variable but is named like a module constant (uppercase with leading underscore). For readability/consistency, consider either making it a class-level constant (e.g., MAX_CONSECUTIVE_CONFIG_FAILURES) or renaming it to a local variable style (e.g., max_consecutive_config_failures).

Suggested change
_MAX_CONSECUTIVE_CONFIG_FAILURES = 2
max_consecutive_config_failures = 2

Copilot uses AI. Check for mistakes.
Comment on lines 104 to 112
# HTTP status code specific errors
if hasattr(exception, "response") and hasattr(exception.response, "status_code"):
status_code = exception.response.status_code
if 500 <= status_code < 600:
if status_code == 400:
return ErrorCategory.CONFIGURATION
elif 500 <= status_code < 600:
return ErrorCategory.NETWORK
elif status_code == 401:
return ErrorCategory.AUTHENTICATION
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HTTP 400 handling here is likely never reached for common httpx errors (e.g., httpx.HTTPStatusError), because earlier in this method network_exceptions includes httpx.HTTPError and the code returns ErrorCategory.NETWORK before checking exception.response.status_code. Consider handling HTTPStatusError (or status codes) before the broad httpx.HTTPError isinstance check, or removing httpx.HTTPError from the network_exceptions tuple so 4xx/5xx responses can be categorized as intended.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants