Skip to content

failTask double-emits task_failed (and over-decrements concurrency) on Durable Execution step retry #55

@scoropeza

Description

@scoropeza

Follow-up from PR #52 — surfaced during hands-on deploy-validation. Tracked here so the fix can land on its own merits, separately from the interactive-agents rev-6 narrative.

Functional description

When a task fails (e.g. the user's prompt is blocked by the safety filter), the system is supposed to emit exactly one task_failed event. In practice, a single logical failure can emit 5-6 task_failed events over a ~60 s window.

User-visible impact:

  1. GitHub comment spam — once the fanout dispatcher is wired to GitHub (landed in PR feat(interactive-agents): async-only background task UX + Cedar HITL design #52), each task_failed event produces a PATCH against the PR comment. One guardrail-blocked task produced 5 redundant GitHub writes in testing.
  2. Concurrency counter drift — the user's concurrency counter is decremented once per emission, so one failed task may be counted as N completions. The user's "tasks currently running" count drifts away from reality (bounded but real).
  3. Event-stream noise — watch / CLI / dashboard consumers see a stream of duplicate task_failed events for one logical failure.

Observed in testing: task 01KQT96W0B8J48HDWGF4YA9V7D (PR #52 Scenario 7-extended take 3): 6 task_failed rows in TaskEventsTable for one guardrail-blocked submission, timestamps spread across 60 s with the classic Durable Execution retry backoff (1 s / 7 s / 17 s / 8 s / 23 s).

Technical root cause

failTask in cdk/src/handlers/shared/orchestrator.ts (lines 618-637) emits task_failed and calls decrementConcurrency unconditionally, even when transitionTask raises ConditionalCheckFailedException because the task is already in FAILED state:

// cdk/src/handlers/shared/orchestrator.ts:618-637 (simplified)
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
  // ← swallowed; no re-throw, no guard flag
}
await emitTaskEvent(taskId, 'task_failed', { error_message });   // ← always fires
if (releaseConcurrency) await decrementConcurrency(userId);      // ← always fires

The caller at cdk/src/handlers/orchestrate-task.ts:117-119 catches, calls failTask, then throw err to signal step failure to the @aws/durable-execution-sdk-js runtime. The SDK retries context.step('hydrate-context', ...) with exponential backoff. Each retry re-enters failTask → re-emits and re-decrements, because the ConditionalCheckFailed on the transition is swallowed without gating downstream side effects.

CloudWatch evidence (abridged):

20:00:02  Content blocked by guardrail … PROMPT_ATTACK LOW
20:00:04  WARN Failed to transition task to FAILED … conditional request failed
20:00:11  WARN Failed to transition task to FAILED …
20:00:28  WARN Failed to transition task to FAILED …
20:00:36  WARN Failed to transition task to FAILED …
20:00:59  WARN Failed to transition task to FAILED …

Proposed fix

Gate the side effects in failTask behind a transitioned flag, mirroring the pattern already used at orchestrator.ts:478-499 (the heartbeat-stale path):

let transitioned = false;
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
  transitioned = true;
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
}
if (transitioned) {
  await emitTaskEvent(taskId, 'task_failed', { error_message });
  if (releaseConcurrency) await decrementConcurrency(userId);
}

This is a missed consistency fix — the sibling code path already uses exactly this shape. PR #52's fanout-to-GitHub wiring just made the downstream effects visible on a real PR.

Acceptance criteria

  • failTask runs its emit + decrement side effects only when transitionTask succeeded.
  • Unit test: call failTask twice for the same task, assert exactly one emitTaskEvent + one decrementConcurrency invocation (the second call's ConditionalCheckFailed is benignly swallowed).
  • Integration assertion: a guardrail-blocked path produces exactly one task_failed row in TaskEventsTable under Durable Execution step retry, not N.
  • Audit the sibling call site at orchestrator.ts:155 (session-start) for the same pattern and fix if applicable.

Out of scope

  • Changing the Durable Execution retry policy itself (the retries are legitimate — the fix is idempotent side effects, not "don't retry").
  • Concurrency reconciler tuning (follow-up if concurrency drift is observed in production after this fix).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions