You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up from PR #52 — surfaced during hands-on deploy-validation. Tracked here so the fix can land on its own merits, separately from the interactive-agents rev-6 narrative.
Functional description
When a task fails (e.g. the user's prompt is blocked by the safety filter), the system is supposed to emit exactly one task_failed event. In practice, a single logical failure can emit 5-6 task_failed events over a ~60 s window.
Concurrency counter drift — the user's concurrency counter is decremented once per emission, so one failed task may be counted as N completions. The user's "tasks currently running" count drifts away from reality (bounded but real).
Event-stream noise — watch / CLI / dashboard consumers see a stream of duplicate task_failed events for one logical failure.
Observed in testing: task 01KQT96W0B8J48HDWGF4YA9V7D (PR #52 Scenario 7-extended take 3): 6 task_failed rows in TaskEventsTable for one guardrail-blocked submission, timestamps spread across 60 s with the classic Durable Execution retry backoff (1 s / 7 s / 17 s / 8 s / 23 s).
Technical root cause
failTask in cdk/src/handlers/shared/orchestrator.ts (lines 618-637) emits task_failed and calls decrementConcurrencyunconditionally, even when transitionTask raises ConditionalCheckFailedException because the task is already in FAILED state:
// cdk/src/handlers/shared/orchestrator.ts:618-637 (simplified)try{awaittransitionTask(taskId,fromStatus,TaskStatus.FAILED,{...});}catch(err){logger.warn('Failed to transition task to FAILED',{...});// ← swallowed; no re-throw, no guard flag}awaitemitTaskEvent(taskId,'task_failed',{ error_message });// ← always firesif(releaseConcurrency)awaitdecrementConcurrency(userId);// ← always fires
The caller at cdk/src/handlers/orchestrate-task.ts:117-119 catches, calls failTask, then throw err to signal step failure to the @aws/durable-execution-sdk-js runtime. The SDK retries context.step('hydrate-context', ...) with exponential backoff. Each retry re-enters failTask → re-emits and re-decrements, because the ConditionalCheckFailed on the transition is swallowed without gating downstream side effects.
CloudWatch evidence (abridged):
20:00:02 Content blocked by guardrail … PROMPT_ATTACK LOW
20:00:04 WARN Failed to transition task to FAILED … conditional request failed
20:00:11 WARN Failed to transition task to FAILED …
20:00:28 WARN Failed to transition task to FAILED …
20:00:36 WARN Failed to transition task to FAILED …
20:00:59 WARN Failed to transition task to FAILED …
Proposed fix
Gate the side effects in failTask behind a transitioned flag, mirroring the pattern already used at orchestrator.ts:478-499 (the heartbeat-stale path):
lettransitioned=false;try{awaittransitionTask(taskId,fromStatus,TaskStatus.FAILED,{...});transitioned=true;}catch(err){logger.warn('Failed to transition task to FAILED',{...});}if(transitioned){awaitemitTaskEvent(taskId,'task_failed',{ error_message });if(releaseConcurrency)awaitdecrementConcurrency(userId);}
This is a missed consistency fix — the sibling code path already uses exactly this shape. PR #52's fanout-to-GitHub wiring just made the downstream effects visible on a real PR.
Acceptance criteria
failTask runs its emit + decrement side effects only when transitionTask succeeded.
Unit test: call failTask twice for the same task, assert exactly one emitTaskEvent + one decrementConcurrency invocation (the second call's ConditionalCheckFailed is benignly swallowed).
Integration assertion: a guardrail-blocked path produces exactly one task_failed row in TaskEventsTable under Durable Execution step retry, not N.
Audit the sibling call site at orchestrator.ts:155 (session-start) for the same pattern and fix if applicable.
Out of scope
Changing the Durable Execution retry policy itself (the retries are legitimate — the fix is idempotent side effects, not "don't retry").
Concurrency reconciler tuning (follow-up if concurrency drift is observed in production after this fix).
Functional description
When a task fails (e.g. the user's prompt is blocked by the safety filter), the system is supposed to emit exactly one
task_failedevent. In practice, a single logical failure can emit 5-6task_failedevents over a ~60 s window.User-visible impact:
task_failedevent produces a PATCH against the PR comment. One guardrail-blocked task produced 5 redundant GitHub writes in testing.task_failedevents for one logical failure.Observed in testing: task
01KQT96W0B8J48HDWGF4YA9V7D(PR #52 Scenario 7-extended take 3): 6task_failedrows inTaskEventsTablefor one guardrail-blocked submission, timestamps spread across 60 s with the classic Durable Execution retry backoff (1 s / 7 s / 17 s / 8 s / 23 s).Technical root cause
failTaskincdk/src/handlers/shared/orchestrator.ts(lines 618-637) emitstask_failedand callsdecrementConcurrencyunconditionally, even whentransitionTaskraisesConditionalCheckFailedExceptionbecause the task is already inFAILEDstate:The caller at
cdk/src/handlers/orchestrate-task.ts:117-119catches, callsfailTask, thenthrow errto signal step failure to the@aws/durable-execution-sdk-jsruntime. The SDK retriescontext.step('hydrate-context', ...)with exponential backoff. Each retry re-entersfailTask→ re-emits and re-decrements, because theConditionalCheckFailedon the transition is swallowed without gating downstream side effects.CloudWatch evidence (abridged):
Proposed fix
Gate the side effects in
failTaskbehind atransitionedflag, mirroring the pattern already used atorchestrator.ts:478-499(the heartbeat-stale path):This is a missed consistency fix — the sibling code path already uses exactly this shape. PR #52's fanout-to-GitHub wiring just made the downstream effects visible on a real PR.
Acceptance criteria
failTaskruns its emit + decrement side effects only whentransitionTasksucceeded.failTasktwice for the same task, assert exactly oneemitTaskEvent+ onedecrementConcurrencyinvocation (the second call'sConditionalCheckFailedis benignly swallowed).task_failedrow inTaskEventsTableunder Durable Execution step retry, not N.orchestrator.ts:155(session-start) for the same pattern and fix if applicable.Out of scope
References
cdk/src/handlers/shared/orchestrator.ts:618-637—failTask(bug site)cdk/src/handlers/shared/orchestrator.ts:478-499— sibling pattern to mirrorcdk/src/handlers/orchestrate-task.ts:117-119— caller that triggers the retry loop viathrow err