Surface red team scan errors in run results#45772
Surface red team scan errors in run results#45772slister1001 wants to merge 2 commits intoAzure:mainfrom
Conversation
When all attacks fail due to a configuration error (e.g., unavailable model), the run previously completed with 0 results and no error message. Users had no way to understand what went wrong. Changes: - Add error field to RedTeamRun TypedDict for run-level error reporting - Add _aggregate_run_errors() to ResultProcessor to collect per-category errors from red_team_info into a structured run-level error - Classify HTTP 400 as CONFIGURATION in ExceptionHandler so systemic config errors (unavailable_model, bad credentials) are detected - Add early abort in Foundry execution manager after 2 consecutive configuration errors to avoid wasting time on remaining categories - Add _is_configuration_error() helper to detect systemic issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR improves red team scan diagnostics by surfacing run-level failures (especially systemic configuration issues) in the final run results, and by aborting Foundry execution early when repeated configuration failures indicate the rest of the scan would be wasted work.
Changes:
- Add an optional
errorfield to theRedTeamRunTypedDict to report run-level failures. - Add
ResultProcessor._aggregate_run_errors()to summarize per-category failures fromred_team_infointo a structured run-level error. - Improve systemic error detection by classifying HTTP 400 as a configuration error and aborting Foundry execution after consecutive configuration-like failures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/exception_utils.py | Updates HTTP status categorization to treat 400 as configuration-related. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py | Adds run-level error aggregation and attaches it to the run payload when status is failed. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team_result.py | Extends RedTeamRun schema with an optional error field. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_execution_manager.py | Adds configuration-error detection and early abort logic after repeated config failures. |
| error_msg = details.get("error") | ||
| if error_msg: | ||
| errors.append(f"{risk_value}: {error_msg}") | ||
|
|
There was a problem hiding this comment.
errors.append(f"{risk_value}: {error_msg}") bakes the risk category into the string, which prevents the later dict.fromkeys(errors) step from actually deduplicating identical underlying errors across categories (each entry will differ by risk_value). If you want run-level dedup for systemic config issues, consider deduping on the raw error_msg (and optionally tracking which categories share it).
| @staticmethod | ||
| def _aggregate_run_errors( | ||
| red_team_info: Optional[Dict], | ||
| ) -> Optional[Dict[str, Any]]: | ||
| """Collect error messages from failed risk categories into a run-level error. | ||
|
|
||
| Scans red_team_info for tasks with failed/incomplete/timeout status and | ||
| extracts their error messages. Deduplicates identical errors (common when | ||
| a config issue like an unavailable model affects all categories). |
There was a problem hiding this comment.
This new run-level error aggregation logic doesn’t appear to be covered by unit tests. Since tests/unittests/test_redteam/ already has coverage for other ResultProcessor helpers, it would be good to add tests for: (1) status==failed producing a non-null error, (2) multiple category failures producing a stable summary message, and (3) dedup behavior for identical error messages.
| # Track consecutive failures to detect systemic issues | ||
| # (e.g., unavailable model, bad credentials) | ||
| if self._is_configuration_error(e): | ||
| consecutive_config_failures += 1 | ||
| if consecutive_config_failures >= _MAX_CONSECUTIVE_CONFIG_FAILURES: | ||
| remaining = [ | ||
| rc.value | ||
| for rc in risk_categories | ||
| if rc.value | ||
| not in {rv for rd in red_team_info.values() if isinstance(rd, dict) for rv in rd} | ||
| ] | ||
| if remaining: | ||
| abort_msg = ( | ||
| f"Aborting remaining {len(remaining)} risk categories " | ||
| f"after {consecutive_config_failures} consecutive configuration errors. " | ||
| f"Root cause: {e}" | ||
| ) | ||
| self.logger.error(abort_msg) | ||
| for rv in remaining: | ||
| if "Foundry" not in red_team_info: | ||
| red_team_info["Foundry"] = {} | ||
| red_team_info["Foundry"][rv] = { | ||
| "data_file": "", | ||
| "status": "failed", | ||
| "error": str(e), | ||
| "asr": 0.0, | ||
| } | ||
| break |
There was a problem hiding this comment.
The new early-abort behavior after consecutive configuration errors is not covered by tests. Since tests/unittests/test_redteam/test_foundry.py already exercises execute_attacks, please add an async test that forces two consecutive configuration-like failures (e.g., mocked exception with response.status_code=400) and asserts: remaining risk categories are marked failed in red_team_info, and the loop aborts without attempting further orchestrator executions.
|
|
||
| red_team_info: Dict[str, Dict[str, Any]] = {} | ||
| consecutive_config_failures = 0 | ||
| _MAX_CONSECUTIVE_CONFIG_FAILURES = 2 |
There was a problem hiding this comment.
_MAX_CONSECUTIVE_CONFIG_FAILURES is a function-local variable but is named like a module constant (uppercase with leading underscore). For readability/consistency, consider either making it a class-level constant (e.g., MAX_CONSECUTIVE_CONFIG_FAILURES) or renaming it to a local variable style (e.g., max_consecutive_config_failures).
| _MAX_CONSECUTIVE_CONFIG_FAILURES = 2 | |
| max_consecutive_config_failures = 2 |
| # HTTP status code specific errors | ||
| if hasattr(exception, "response") and hasattr(exception.response, "status_code"): | ||
| status_code = exception.response.status_code | ||
| if 500 <= status_code < 600: | ||
| if status_code == 400: | ||
| return ErrorCategory.CONFIGURATION | ||
| elif 500 <= status_code < 600: | ||
| return ErrorCategory.NETWORK | ||
| elif status_code == 401: | ||
| return ErrorCategory.AUTHENTICATION |
There was a problem hiding this comment.
The new HTTP 400 handling here is likely never reached for common httpx errors (e.g., httpx.HTTPStatusError), because earlier in this method network_exceptions includes httpx.HTTPError and the code returns ErrorCategory.NETWORK before checking exception.response.status_code. Consider handling HTTPStatusError (or status codes) before the broad httpx.HTTPError isinstance check, or removing httpx.HTTPError from the network_exceptions tuple so 4xx/5xx responses can be categorized as intended.
When all attacks fail due to a configuration error (e.g., unavailable model), the run previously completed with 0 results and no error message. Users had no way to understand what went wrong.
Changes:
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines