failTask double-emits task_failed (and over-decrements concurrency) on Durable Execution step retry

> **Follow-up from [PR #52](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/pull/52)** — surfaced during hands-on deploy-validation. Tracked here so the fix can land on its own merits, separately from the interactive-agents rev-6 narrative.

## Functional description

When a task fails (e.g. the user's prompt is blocked by the safety filter), the system is supposed to emit exactly one `task_failed` event. In practice, a single logical failure can emit **5-6 `task_failed` events** over a ~60 s window.

**User-visible impact:**

1. **GitHub comment spam** — once the fanout dispatcher is wired to GitHub (landed in PR #52), each `task_failed` event produces a PATCH against the PR comment. One guardrail-blocked task produced 5 redundant GitHub writes in testing.
2. **Concurrency counter drift** — the user's concurrency counter is decremented once per emission, so one failed task may be counted as N completions. The user's "tasks currently running" count drifts away from reality (bounded but real).
3. **Event-stream noise** — watch / CLI / dashboard consumers see a stream of duplicate `task_failed` events for one logical failure.

**Observed in testing:** task `01KQT96W0B8J48HDWGF4YA9V7D` (PR #52 Scenario 7-extended take 3): 6 `task_failed` rows in `TaskEventsTable` for one guardrail-blocked submission, timestamps spread across 60 s with the classic Durable Execution retry backoff (1 s / 7 s / 17 s / 8 s / 23 s).

## Technical root cause

`failTask` in `cdk/src/handlers/shared/orchestrator.ts` (lines 618-637) emits `task_failed` and calls `decrementConcurrency` **unconditionally**, even when `transitionTask` raises `ConditionalCheckFailedException` because the task is already in `FAILED` state:

```ts
// cdk/src/handlers/shared/orchestrator.ts:618-637 (simplified)
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
  // ← swallowed; no re-throw, no guard flag
}
await emitTaskEvent(taskId, 'task_failed', { error_message });   // ← always fires
if (releaseConcurrency) await decrementConcurrency(userId);      // ← always fires
```

The caller at `cdk/src/handlers/orchestrate-task.ts:117-119` catches, calls `failTask`, then `throw err` to signal step failure to the `@aws/durable-execution-sdk-js` runtime. The SDK retries `context.step('hydrate-context', ...)` with exponential backoff. Each retry re-enters `failTask` → re-emits and re-decrements, because the `ConditionalCheckFailed` on the transition is swallowed without gating downstream side effects.

**CloudWatch evidence (abridged):**

```
20:00:02  Content blocked by guardrail … PROMPT_ATTACK LOW
20:00:04  WARN Failed to transition task to FAILED … conditional request failed
20:00:11  WARN Failed to transition task to FAILED …
20:00:28  WARN Failed to transition task to FAILED …
20:00:36  WARN Failed to transition task to FAILED …
20:00:59  WARN Failed to transition task to FAILED …
```

## Proposed fix

Gate the side effects in `failTask` behind a `transitioned` flag, mirroring the pattern already used at `orchestrator.ts:478-499` (the heartbeat-stale path):

```ts
let transitioned = false;
try {
  await transitionTask(taskId, fromStatus, TaskStatus.FAILED, {...});
  transitioned = true;
} catch (err) {
  logger.warn('Failed to transition task to FAILED', {...});
}
if (transitioned) {
  await emitTaskEvent(taskId, 'task_failed', { error_message });
  if (releaseConcurrency) await decrementConcurrency(userId);
}
```

This is a **missed consistency fix** — the sibling code path already uses exactly this shape. PR #52's fanout-to-GitHub wiring just made the downstream effects visible on a real PR.

## Acceptance criteria

- `failTask` runs its emit + decrement side effects only when `transitionTask` succeeded.
- Unit test: call `failTask` twice for the same task, assert exactly one `emitTaskEvent` + one `decrementConcurrency` invocation (the second call's `ConditionalCheckFailed` is benignly swallowed).
- Integration assertion: a guardrail-blocked path produces exactly one `task_failed` row in `TaskEventsTable` under Durable Execution step retry, not N.
- Audit the sibling call site at `orchestrator.ts:155` (`session-start`) for the same pattern and fix if applicable.

## Out of scope

- Changing the Durable Execution retry policy itself (the retries are legitimate — the fix is idempotent side effects, not "don't retry").
- Concurrency reconciler tuning (follow-up if concurrency drift is observed in production after this fix).

## References

- `cdk/src/handlers/shared/orchestrator.ts:618-637` — `failTask` (bug site)
- `cdk/src/handlers/shared/orchestrator.ts:478-499` — sibling pattern to mirror
- `cdk/src/handlers/orchestrate-task.ts:117-119` — caller that triggers the retry loop via `throw err`
- PR #52 Scenario 7-extended take 3 deploy-validation transcript


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failTask double-emits task_failed (and over-decrements concurrency) on Durable Execution step retry #55

Functional description

Technical root cause

Proposed fix

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

failTask double-emits task_failed (and over-decrements concurrency) on Durable Execution step retry #55

Description

Functional description

Technical root cause

Proposed fix

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions