feat(interactive-agents): async-only background task UX + Cedar HITL design by scoropeza · Pull Request #52 · aws-samples/sample-autonomous-cloud-coding-agents

scoropeza · 2026-04-29T05:54:01Z

Summary

Introduces interactive background agents — users can watch, steer, trace, and review agent runs through a unified async-only architecture. Ships five user-facing surfaces (watch, nudge, status, trace, webhook) plus the first real notification-plane dispatcher (GitHub edit-in-place). Includes the locked design for Phase 3 Cedar-driven HITL approval gates — no implementation; design only.

All shipped features are deployed to a dev stack and E2E-validated against scoropeza/agent-plugins PRs #32/#33/#43. Three runtime bugs were caught and fixed during deploy-validation — see Deploy validation below.

Branch state: ~82 commits. Will squash on merge.

What ships (deployed + tested)

Polling / observability

bgagent watch <task-id> — adaptive polling (500 ms active; 1 / 2 / 5 s idle backoff; resets on event arrival), cold-start retry wrapper, clean terminal exit. Streams turn-by-turn events from TaskEventsTable via GET /v1/tasks/{id}/events?after=<cursor>.
bgagent status <task-id> — deterministic snapshot (AD-6): Type, Channel, Description, Turn, Last milestone, Current, Cost, Reason, Trace S3, Last event. --wait is a pure blocking flag; same layout in both modes.
bgagent events <task-id> — full event log (paged).
--verbose — HTTP debug output on stderr.

Interactive steering

bgagent nudge <task-id> "<message>" — mid-run steering via POST /v1/tasks/{id}/nudges → TaskNudgesTable → agent's between-turns hook. Combined-turn acknowledgement per AD-5 — the agent emits a nudge_acknowledged milestone before incorporating the nudge, giving users a visible handshake without the complexity of mid-turn interrupt.
bgagent cancel <task-id> — terminates the running agent, persists task_cancelled, prevents PR push.

Trace / debug

--trace on submit — raises progress-writer preview cap from 200 chars to 4 KB and uploads a full gzipped NDJSON trajectory to S3 on terminal.
bgagent trace download <task-id> — streams gzipped NDJSON to stdout (pipe-friendly: | gunzip | jq -s .) or writes raw gzip to -o <file> with a --force overwrite guard.

Notification plane (fanout)

FanOutConsumer Lambda on TaskEventsTable DDB Streams (ParallelizationFactor: 1 for per-task ordering) routes events to per-channel dispatchers with configurable default filters (AD-4).
GitHub edit-in-place dispatcher — single status comment per task on the target PR, edited in place as progress events fire. Slack / Email dispatchers are log-only stubs pending integration.

Webhook submission

bgagent webhook create / list / revoke — registers external integrations. HMAC-SHA256 request verification (X-Webhook-Signature) + REQUEST authorizer for lookup. channel_source: "webhook" propagates through the task record for observability.

Design-only (not implemented)

Phase 3 — Cedar-driven HITL approval gates

docs/design/PHASE3_CEDAR_HITL.md (1,883 lines) — 22 locked decisions across 17 sections: three-outcome model (allow / hard-deny / soft-deny), policy authoring guide, evaluate_tool_use skeleton, ApprovalAllowlist + RecentDecisionCache, REST API contract, CLI UX, state-machine concurrency, data model, observability, security, failure modes, scenarios, carry-forward list.
docs/diagrams/phase3-cedar-hitl.drawio — 12-page companion.
Plumbing only: cedar_policies is threaded through blueprintConfig → repoConfig → orchestrator → agent payload shape (agent/src/models.py), and agent/src/hooks.py has a documented PreToolUse stub calling out Cedar evaluation as the next step. No Cedar engine, no policy evaluation, no approval flow. Implementation is a separate PR (~3–4 weeks for 3a core).
@cedar-policy/cedar-wasm@4.10.0 locked as the engine choice after WASM vs native spike.

Architectural highlights

Async-only polling (rev-6). No SSE, no two-runtime split. Single AgentCore runtime, GET /events?after=<cursor> replaces the prior streaming surface.
DDB-Stream ordering for fan-out — ParallelizationFactor: 1 guarantees per-task ordering; no SQS FIFO, no ETag. (AD-9, with an explicit "rejected alternative" for GitHub's If-Match — see Deploy validation bug Docs: user guide hard-codes us-east-1 #3.)
Combined-turn nudge ack (AD-5) — agent emits nudge_acknowledged BEFORE folding the nudge into the next turn; single milestone, no consume-side event.
Deterministic status snapshot (AD-6) — formatStatusSnapshot derives every rendered field from TaskDetail + a small window of recent events; no LLM in the loop, no fabricated state.
agent_milestone unwrap — agent writes named checkpoints (pr_created, nudge_acknowledged, …) as agent_milestone events with metadata.milestone. Fanout unwraps against a narrow ROUTABLE_MILESTONES allowlist before matching channel filters.
Numeric coercion at the DDB boundary — shared coerceNumericOrNull helper handles the Document-client number-as-string quirk uniformly.

Design doc

docs/design/INTERACTIVE_AGENTS.md — canonical rev-6 design. Sections: polling (§5), notification plane (§6), fanout routing (§6.2), GitHub edit-in-place (§6.4), observability / --trace (§10.1), concurrency rationale (§8.9), AD-1 through AD-9.
docs/diagrams/interactive-agents-phases.drawio — 5-page architecture (1. rev-6 topology, 2. watch polling, 3. nudge flow, 4. status + trace, 5. fanout + GitHub edit-in-place).
docs/design/PHASE3_CEDAR_HITL.md + 12-page drawio (design-only; see above).

Deploy validation

Every shipped surface was E2E-validated against a dev stack (backgroundagent-dev, us-east-1). The validation pass caught three runtime bugs that passed all unit tests but failed on live AWS — all are fixed in this PR:

costUsd.toFixed is not a function — DDB Document-client returns Number attributes as strings; fanout dispatcher was not coercing. Fix: shared coerceNumericOrNull helper at the fanout + orchestrator boundaries.
agent_milestone wrapper events never routed — fanout filter matched event_type literally; missed pr_created wrapped in agent_milestone. Fix: effectiveEventType unwrap against ROUTABLE_MILESTONES allowlist.
GitHub rejects If-Match on PATCH /issues/comments/{id} with HTTP 400 ("Conditional request headers are not allowed in unsafe requests unless supported by the endpoint"). The earlier ETag optimistic-concurrency design never worked in production. Fix: drop If-Match entirely; rely on DDB-Stream ordering + single-writer invariant (AD-9 rewritten; §6.4 rewritten).

Also surfaced + fixed during validation: Bedrock Guardrail inputStrength over-blocked (HIGH → MEDIUM); fanout dispatcher silent-ignored non-finite numeric input (now warns); bgagent watch didn't exit cleanly on terminal (deferred-exit fallback); bgagent status snapshot missed Type / Channel / Description / Reason (all surfaced); channel_source silently dropped by API response (now included in TaskDetail); specific classifiers for agent_status=error_max_turns / error_max_budget_usd / error_during_execution (were falling through to a generic "Agent task did not succeed").

Test plan

Agent: uv run pytest -q — 468 passed
CLI: yarn jest — 190 passed
CDK: yarn jest --runInBand — 1001 passed
CLI tsc + CDK tsc — clean
Deployed to dev stack + 15 hands-on scenarios executed against scoropeza/agent-plugins PRs feat(compute): add EC2 fleet compute strategy #32, fix(agent): validate AWS credentials before Docker build #33, chore(deps): bump astro from 6.1.3 to 6.1.6 #43: new_task happy path, pr_iteration, pr_review, nudge + AD-5 ack, cancel, --trace download with -o / --force, submit --wait, caps with specific reason classifiers, webhook HMAC round-trip, --idempotency-key, status snapshot variants, watch edge cases, --verbose debug.

Known limitations / follow-ups

Filed separately (see .l4-issue-drafts/ on this branch; 6 issue drafts ready to file post-merge):

failTask double-emits task_failed + over-decrements concurrency on Durable Execution step retry.
Bedrock Guardrail PROMPT_ATTACK over-triggers on imperative-mood prompts even at MEDIUM (example log in the draft).
Idempotent submit returns DUPLICATE_TASK error instead of the original task_id — non-standard semantics.
webhook_url missing from CreateWebhookResponse — users have to construct it by hand.
Pre-existing drafts: session-tag IAM tightening; container CVE rebase cadence.

Deferred feature work (tracked elsewhere):

Phase 3 Cedar HITL implementation — ~3–4 weeks for 3a core; separate PR after merge.
Slack / Email dispatchers (currently log-only stubs).
GitHub webhook body-shape adapter (today's webhook path expects {repo, task_description, ...}; GitHub sends IssueCommentEvent / PullRequestEvent).

Reviewers — suggested reading order

docs/design/INTERACTIVE_AGENTS.md — canonical rev-6 design (ADs, architecture, §5 / §6 / §10).
docs/diagrams/interactive-agents-phases.drawio — 5-page architecture.
CDK handlers: cdk/src/handlers/fanout-task-events.ts, cdk/src/handlers/shared/github-comment.ts, cdk/src/handlers/shared/orchestrator.ts, cdk/src/handlers/shared/error-classifier.ts, cdk/src/handlers/shared/numeric.ts.
CDK constructs: cdk/src/constructs/task-api.ts, cdk/src/constructs/task-events-table.ts, cdk/src/constructs/task-nudges-table.ts, cdk/src/stacks/agent.ts.
Agent runtime: agent/src/pipeline.py, agent/src/runner.py, agent/src/hooks.py, agent/src/progress_writer.py, agent/src/telemetry.py.
CLI: cli/src/commands/{watch,nudge,status,submit,trace,webhook}.ts, cli/src/format.ts, cli/src/wait.ts, cli/src/bin/bgagent.ts.
Phase 3 design (optional; will be a separate PR): docs/design/PHASE3_CEDAR_HITL.md, docs/diagrams/phase3-cedar-hitl.drawio.

GitHub's Squash and merge is the intended merge strategy.

- INTERACTIVE_AGENTS.md: comprehensive design for bidirectional agent-user communication (progress streaming, nudges, HITL approval gates, pause/resume) - Architecture diagrams (4 pages): current state, SSE streaming, nudge flow, state machine extensions - Research prompt that drove the design work

Detailed prompt for implementing DDB progress events + CLI watch command. Covers: ProgressWriter, entrypoint integration, CLI watch command, tests, and constraints. Ready for a fresh agent session.

…e 1a prompt - Use subagents to parallelize agent-side and CLI-side work - Keep main context clean by delegating reads/research to subagents - CRITICAL: never deviate from approved design without explicit approval - On blockers: research first, then surface for discussion

Introduces _ProgressWriter that emits structured AG-UI-style progress events to the existing TaskEventsTable during agent execution. Events cover turns, tool calls, tool results, milestones, cost updates, and errors. - progress_writer.py: lazy boto3 init, fail-open writes, circuit breaker after 3 consecutive DDB failures, ULID-sortable event IDs, 90-day TTL - entrypoint.py: integrated into run_agent() message loop and run_task() pipeline; handles UserMessage branch to capture ToolResultBlocks - Dockerfile: add progress_writer.py to the agent image COPY layer - 25 unit tests cover all event types, truncation, fail-open, and circuit breaker behavior

Adds a new bgagent watch <task_id> command that polls GET /tasks/{id}/events every 2 seconds, renders progress events in a human-readable format, and exits cleanly when the task reaches a terminal state. - watch.ts: polling loop, event rendering for all 6 progress event types, JSON output mode (--output json), Ctrl+C handling, last-seen event tracking to avoid duplicates - bgagent.ts: register the new command - 14 unit tests cover event rendering, polling, terminal state detection, deduplication, and JSON mode Exit codes: 0 on COMPLETED, 1 on FAILED/CANCELLED/TIMED_OUT.

Enables end-to-end local testing of progress events without deploying to AWS. The agent container connects to a local DynamoDB instance on a shared Docker network; boto3 endpoint redirection via AWS_ENDPOINT_URL_DYNAMODB means zero code changes — the same _ProgressWriter code path runs locally and in production. - docker-compose.yml: dynamodb-local service on agent-local network - scripts/create-local-tables.sh: idempotent creation of TaskEventsTable and TaskTable schemas matching the CDK constructs - mise.toml: local:up (start + create tables), local:down, local:events, local:events:json - run.sh: new --local-events flag that joins the agent container to the agent-local network and sets the right env vars Workflow: cd agent && mise run local:up ./run.sh --local-events "owner/repo" 42 mise run local:events # in another terminal mise run local:down

- docs/design/INTERACTIVE_AGENTS.md rev 3: adds Section 9.9 local testing subsection with the DDB Local workflow; updates "last updated" marker - docs/guides/DEVELOPER_GUIDE.md: new "Testing with progress events (DynamoDB Local)" section under Local testing - docs/guides/USER_GUIDE.md: new "Watching a task in real time" subsection documenting bgagent watch (behavior only — transport details intentionally omitted, will change in Phase 1b) - AGENTS.md: routing table row for progress_writer.py so future coding sessions know where progress events are emitted from - docs/src/content/docs/**: Starlight sync output from the above

Captures two issues discovered during Phase 1a E2E validation that are scoped outside the interactive-agents feature. Each file contains problem statement, reproduction, proposed design, test cases, and acceptance criteria so a future session can pick it up independently. - CLI_COGNITO_NEW_PASSWORD_CHALLENGE.md: bgagent login does not handle Cognito's NEW_PASSWORD_REQUIRED challenge, blocking first-time users without admin intervention - EARLY_PROGRESS_MILESTONES.md: UX enhancement to emit finer-grained milestones during setup_repo (clone / install / baseline build) so watch output appears within seconds instead of minutes

Team discussion (Sam ↔ Alain, 2026-04-17) agreed to replace the hardcoded 3-tier HITL model with Cedar policy-driven approvals, reusing the in-process Cedar engine already in the repo (on branch fix/validate-aws-before-docker-build as agent/src/policy.py). The existing ALLOW/DENY decisions extend to include REQUIRE_APPROVAL — same policy language, unified governance, supports workflows like AI-DLC where users gate per phase and relax over time. Phase 1a (done) and Phase 1b (next) are unaffected. Phase 3 (HITL) blocked on design revision and the Cedar branch landing on main.

Re-integrates Phase 1a ProgressWriter event emission after rebasing onto upstream/main, which decomposed agent/entrypoint.py into agent/src/ modules (pipeline.py, runner.py, etc.). - pipeline.py: instantiate _ProgressWriter; emit repo_setup_complete, agent_execution_complete, pr_created milestones - runner.py: instantiate _ProgressWriter; track tool_use_id -> tool_name for UserMessage ToolResultBlock correlation; emit agent_turn, agent_tool_call, agent_tool_result, agent_cost_update, agent_error No behavior change vs. Phase 1a - same event shapes, same call sites moved to their new module homes. All 305 agent tests pass.

Updates the interactive agents design doc and adds two draw.io files documenting the Phase 1b architecture. Design doc (rev 4, 2026-04-17): - Revision history section added - New: section 8.9 (control / streaming / fan-out planes + channel-fit matrix), 9.1.1 (two-runtime split, D1 resolved), 9.10 (thread + queue bridge, D2 resolved), 9.11 (CLI hybrid client, D3 resolved), 9.12 (AgentCore streaming limits: 60-min cap, LifecycleConfiguration, no SSE-native resume, SDK timeout overrides, API GW REST incompatibility) - Updated: section 4 Phase 1b parameters, 5 SSEAdapter sibling pattern, 8.2 reconnection details, 9 heading bumped to rev 4, 10 Phase 1b implementation plan rewritten, Appendix C file change map completed - Fixed: exec-summary inaccuracy about SSE 60-min cap (line 23) Diagrams: - docs/interactive-agents-phases.drawio — v1, 8 pages, documents the decision-space (options to choose from) for D1/D2/D3 - docs/interactive-agents-phases-v2.drawio — v2, 8 pages, documents the final resolved Phase 1b architecture with rejected alternatives captured on the rationale page for future reference No code changes. No impact on deployed stack. Phase 1a continues to pass all 305 agent / 721 CDK / 84 CLI tests.

Infrastructure foundation for Phase 1b SSE streaming. Introduces a second AgentCore Runtime with Cognito JWT authorizer for direct CLI-to-AgentCore SSE consumption, while preserving the existing IAM-authed runtime for the orchestrator path. No agent or CLI code changes in this step — those land in Steps 2-6. Changes: - cdk/src/stacks/agent.ts: rename Runtime -> RuntimeIam, add RuntimeJwt with RuntimeAuthorizerConfiguration.usingCognito(userPool, [appClient]). Same artifact, env vars, VPC, models, memory, filesystem mount, and log/trace wiring shared across both runtimes. Explicit LifecycleConfiguration (8h idle + 8h max) on both per §9.12 of design doc. Lazy.string breaks the TaskApi <-> Runtime-JWT <-> Orchestrator circular dependency. OrchestratorFn IAM policy scoped to RuntimeIam only. New CFN outputs RuntimeIamArn and RuntimeJwtArn; RuntimeArn kept as deprecated alias. - cdk/src/constructs/task-api.ts: expose appClient as public attr for Runtime-JWT to reference. - cdk/src/constructs/task-events-table.ts: enable DynamoDB Streams with NEW_IMAGE (in-place update, Phase 1a event data preserved). Prerequisite for the fan-out plane. - cdk/test/**: 8 new tests covering two-runtime topology, JWT authorizer config, LifecycleConfiguration shape, scoped IAM, DDB Streams, new CFN outputs. Full suite 729 passed / 0 failed (--runInBand). Refs docs/design/INTERACTIVE_AGENTS.md §9.1.1, §9.10, §9.12.

First deploy of Phase 1b Step 1 rolled back with: AWS::BedrockAgentCore::Runtime 'jean_cloude' already exists Root cause: I had renamed the existing Runtime construct from 'Runtime' to 'RuntimeIam'. CloudFormation interprets construct id changes as replacement — CREATE new resource, then DELETE old. Because the old Runtime has runtimeName 'jean_cloude' (immutable) and the new RuntimeIam was declared with the same name, CFN tried to create a duplicate and failed on AgentCore's account-level name uniqueness constraint. Fix: revert the construct id to 'Runtime' so CFN updates the existing resource in place (only the new LifecycleConfiguration is added). The TS variable name `runtimeIam` is kept for readability + Phase 1b role documentation. Test updated to match logical id shape via regex. Verified via cdk diff: existing Runtime gets only [~] LifecycleConfig (in-place), RuntimeJwt + log infra are net-new, TaskEventsTable gets StreamSpecification in-place. No destroys of load-bearing resources.

…f ProgressWriter) Introduces _SSEAdapter, the producer side of a per-task asyncio.Queue that will feed the SSE handler in server.py (to be wired in Step 3). Semantic API mirrors _ProgressWriter 1:1 so integration is symmetric: pipeline.py / runner.py will call both adapters at the same sites. Design contract (see docs/design/INTERACTIVE_AGENTS.md §9.10): - Thread-safe: write_agent_* methods run in the pipeline background thread; bridge to the asyncio loop via loop.call_soon_threadsafe (chosen over run_coroutine_threadsafe — Queue.put_nowait is a plain sync call, no need for coroutine + Future wrapping). - Backpressure: bounded asyncio.Queue, drop-oldest when full. Counter dropped_count only ever grows. Never blocks the pipeline thread. - No-subscribers case: writes before attach_loop or after detach_loop are silent drops with counter increment. Pipeline must NEVER be affected by whether a client is connected. - Fail-open: every enqueue path swallows exceptions and bumps the counter. Same philosophy as ProgressWriter. write_agent_error has a second-level try/except — it must NEVER raise. - Close sentinel: object() sentinel distinguishable from any real event dict. get() returns None when drained for clean stream shutdown. Wire-format translation (semantic dict → AG-UI TEXT_MESSAGE_* / TOOL_CALL_* frames) is NOT done here — that's the SSE handler's job in Step 3. This adapter traffics in semantic Python dicts. Tests: 27 new in test_sse_adapter.py covering happy path, all 6 semantic methods, FIFO ordering, no-subscribers drop, post-detach drop, re-attach, queue-full drop-oldest, close sentinel, post-close behavior, thread-safe enqueue from non-loop thread, concurrent producers, loop-closed scenarios, dropped_count monotonicity, payload integrity, large-payload pass-through, bulletproof write_agent_error. Test counts: 305 Phase 1a baseline + 27 new = 332 passed, 0 failed. Lint (ruff check): clean. Format: clean. Type check (ty): clean. Phase 1a ProgressWriter unchanged (deployed + working). No integration in pipeline.py / runner.py / server.py yet — that lands in Step 3.

…ions Phase 1b Step 3 implementation resolves the pending tactical decision in §9.10 and Appendix C: the SSE handler lives on the existing /invocations endpoint via content-type negotiation rather than a new /invocations/stream endpoint. Rationale: one endpoint contract, matches AgentCore's documented AG-UI pattern, zero risk of the orchestrator's sync path being misrouted. - §9.10: replace "deferred to PR" sentence with the resolved choice — Accept: text/event-stream routes to SSE, anything else preserves the existing sync behavior byte-for-byte. - Appendix C: update the server.py row to describe content-type negotiation explicitly. Add new row for agent/src/sse_wire.py — the pure-function translator from semantic events to AG-UI frames, kept separate from transport for testability. - v2 drawio: surgical text updates to 7 labels that referenced /invocations/stream; no edge or layout changes. Validator passes. v1 drawio (interactive-agents-phases.drawio, kept as decision history) is untouched.

…erver.py Makes Phase 1b SSE functional end-to-end inside the agent container. Client-side CLI work lands in Steps 5/6. Endpoint: content-type negotiation on existing /invocations (locked in docs fa622b4). Accept: text/event-stream → new SSE handler. Any other Accept (missing, application/json, */*) → existing sync acceptance response, byte-for-byte preserved. Orchestrator's InvokeAgentRuntime calls are unaffected because they do not send text/event-stream. server.py: - content-type negotiation via _wants_sse() — case-insensitive substring match so qualified types and lists work (e.g. 'text/event-stream;q=1'). - SSE path: create _SSEAdapter, attach_loop, spawn background thread running run_task(..., sse_adapter=adapter), return StreamingResponse. - Async generator drains adapter.get() with 15s timeout; emits ': ping\n\n' keepalive on timeout, AG-UI 'data: <json>\n\n' frames on events. RUN_STARTED synthesised up front; RUN_FINISHED on clean close, RUN_ERROR if any agent_error observed during the stream. - Client disconnect detection: GeneratorExit → detach_loop; background run_task keeps running so DDB durability is unaffected, events then drop silently until the adapter is closed. - Idempotent close() in both server.py finally and pipeline.py finally (belt-and-braces — close is idempotent by design). sse_wire.py (new): pure-function translator from semantic SSEAdapter dicts to AG-UI wire-format events. Kept separate from transport for testability. Mappings per design doc §9.10 and the AG-UI facts section of the resolved-decisions memory: - agent_turn → TEXT_MESSAGE_START / CONTENT / END (+ optional CUSTOM for non-empty thinking since AG-UI has no native thinking event) - agent_tool_call → TOOL_CALL_START / ARGS / END - agent_tool_result → TOOL_CALL_RESULT (role=tool, is_error propagated) - agent_milestone → STEP_STARTED + STEP_FINISHED pair - agent_cost_update → CUSTOM (no native cost event) - agent_error → RUN_ERROR (terminal) or CUSTOM (non-terminal); Phase 1b treats all runner errors as terminal Uses camelCase on the wire, SCREAMING_SNAKE_CASE event types, ULIDs for message/tool_call ids, ms-since-epoch timestamps. pipeline.py: add sse_adapter: _SSEAdapter | None = None kwarg; mirror the 3 progress.write_agent_milestone calls on the adapter. finally block calls adapter.close() so the consumer stream ends cleanly. runner.py: add sse_adapter kwarg; mirror the 5 progress.write_X calls (turn, tool_call, tool_result, cost_update, error). Zero change to ProgressWriter behavior — DDB remains durable source of truth. Tests (38 new, 371 total, 305 Phase 1a baseline preserved): - test_sse_wire.py (29 tests): per-event-type roundtrip, camelCase, ULID shape, timestamp presence, CUSTOM fallback for cost, is_error on tool results, empty-field edge cases, terminality rule for errors. - test_server.py (+6): content-type negotiation routing, SSE happy path, SSE terminates with RUN_ERROR on agent_error, 15s keepalive ping, client-disconnect detach_loop + background continues, sync path regression ×2 (no Accept / Accept: application/json). - test_pipeline.py (+3): sse_adapter=None preserves current behavior, adapter mirrored at milestone sites + close() in finally, runner signature accepts sse_adapter kwarg. Test hardening: test_background_thread_failure_503 now waits for the mock to be called (race existed in baseline too — /ping flips 503 before task_state.write_terminal runs because of intervening print + traceback.print_exc in the except block). No CDK change (Step 1 shipped). No CLI change (Steps 5/6). Not deployed yet — deployment gated on Step 4 (get-task-events ?after= for catch-up). ProgressWriter unchanged.

…tch-up Adds the backend + CLI client pieces for SSE reconnect catch-up. When the CLI's SSE stream drops (network blip, AgentCore 60-min cap), it queries REST /tasks/{id}/events?after=<last_seen_event_id> to replay missed events from DynamoDB, then reopens the SSE stream. This is the ONLY reconnection mechanism because AG-UI has no Last-Event-ID and AgentCore has no SSE-native resume (design §9.12). Handler (cdk/src/handlers/get-task-events.ts): - Accept ?after=<ulid> alongside existing ?next_token. Back-compat: if neither present, existing from-beginning behavior preserved. - Query mode routing: after → KeyConditionExpression 'task_id AND event_id > :after'. ULIDs are Crockford Base32 lexicographically sortable, so string compare is correct. - Collision policy: both ?after and ?next_token present → after wins + WARN log (indicates a client bug). - ?after validation: Crockford Base32 regex, 26 chars. Invalid → 400 VALIDATION_ERROR. - Empty after= string falls through to from-beginning (matches how URLSearchParams omits empty values; avoids spurious 400s). - limit-truncated responses still emit next_token when using after, so callers can paginate beyond the first page. Shared validation (cdk/src/handlers/shared/validation.ts): - New isValidUlid() + ULID_PATTERN. Exported for handler + future uses. Shared types (cdk/src/handlers/shared/types.ts + cli/src/types.ts): - New GetTaskEventsQuery type formalizing the query shape. Mirrored per AGENTS.md "cdk types must stay in sync with cli types" rule. CLI (cli/src/api-client.ts): - getTaskEvents now accepts { after?: string } in the options object alongside existing { next_token?, limit? }. URL-encoded properly. - New catchUpEvents(taskId, afterEventId, pageSize=100) helper that paginates internally. First page uses ?after; subsequent pages use ?next_token (no re-sending after to avoid server WARN). Returns flattened event array — one-call API for Step 5/6 reconnect path. Comprehensive structured logging (per user emphasis this step): - INFO on handler entry: {request_id, task_id, limit, query_mode: 'next_token' | 'after' | 'from_beginning'}. - INFO on handler exit: {request_id, task_id, event_count, has_more}. - WARN when both ?after and ?next_token provided, when ?after fails ULID validation, when DDB returns unexpectedly empty after a cursor. - ERROR on DDB exceptions, malformed response, unexpected errors — all include error_type. - DEBUG-style gated via LOG_LEVEL=DEBUG env; emits INFO with tag level_override='DEBUG' so CloudWatch filters distinguish them without requiring a logger-module change. Tests: CDK 729 → 745 (+16: 9 handler + 7 validation). CLI 84 → 90 (+6). All pass, compile clean, lint clean. No agent code touched. No deploy yet — chained with Steps 5/6/7.

…+ fetch) Transport layer for bgagent watch's upcoming SSE path (Step 6). D3 hybrid bet locked in: import @ag-ui/core for event TYPES + Zod schemas only; own transport with native fetch + eventsource-parser; own the reconnection/backoff/JWT-refresh/60-min-restart logic. No @ag-ui/client anywhere (only grep hits are docstrings explicitly calling out why we do not use it). Deps: - @ag-ui/core@^0.0.52 (types + Zod schemas, pulls zod transitively) - eventsource-parser@^3.0.8 (SSE frame parser, zero deps, ~32.7M wk) cli/src/sse-client.ts (746 lines): - runSseClient(options): Promise<SseRunResult>. Happy path, reconnect, catch-up, terminal detection, external cancellation. - AgentCore URL builder + required header (x-amzn-bedrock-agentcore- runtime-session-id = task_id, matches server.py extraction). - Frame parsing: eventsource-parser → JSON.parse → Zod EventSchemas.safeParse; Zod failures still forward by type sniff so future AG-UI additions do not break terminal detection. - Keepalive watchdog (default 30s grace) resets on any byte incl ': ping\n\n' comments; on starve, aborts and reconnects. - Proactive 60-min restart at maxStreamSeconds (default 3500) pre- empts AgentCore's hard cap. - Exponential backoff reconnect (1s initial, factor 2, max 30s). - Dedup by id chain: id → messageId → toolCallId → runId → stepName+ts → name+ts → type+ts fallback. Set + ordered array; capped at 10000 with oldest-half eviction. - 401 handling: one forced refresh via getAuthToken(), retry once; double-401 rejects with CliError(UNAUTHORIZED) + login hint. - External AbortSignal support. - Terminal codes fatal (no reconnect): AGENT_ERROR, AgentError, UNAUTHORIZED, ACCESS_DENIED, missing code; transient RUN_ERROR reconnects. Catch-up cursor architecture: - SSE events from sse_wire.py do NOT carry DDB event_ids. - Client accepts initialCatchUpCursor + delegates cursor management to the caller. Cursor advances only on events that carry an explicit 'id' field. watch.ts (Step 6) will inject DDB event_id as id on events returned from its catchUp closure; live events never advance the cursor; dedup handles replay overlap. Comprehensive structured logging at DEBUG/INFO/WARN/ERROR as the user requested this step. Tests: 35 new (90 baseline -> 125 total). Coverage 95.6% stmts / 77% branch / 88% funcs. mise //cli:build clean. No CDK, agent, or design- doc changes. Step 6 consumes this: watch.ts must inject DDB event_id as id on catchUp() events so the cursor advances; absence only causes re-fetch (dedup prevents double-emit).

…lback Rewires bgagent watch to use the Step 5 SSE client as primary transport with REST polling as fallback. Final client-side piece before deploy + E2E (Step 7). New --transport <sse|polling|auto> flag (default auto). --stream- timeout-seconds passes through to maxStreamSeconds (default 3500 = 58 min; pre-empts AgentCore's 60-min cap). Existing --output and existing text/JSON formatting preserved byte-for-byte. New cli/src/ag-ui-translator.ts: - translateDbRowToAgUi(row): mirror of agent/src/sse_wire.py in TS. Emits the same AG-UI triad/pair shapes (TEXT_MESSAGE_*, TOOL_CALL_*, STEP_*, CUSTOM for cost, RUN_ERROR) so live-SSE and catch-up-REST frames are indistinguishable to dedup + formatter. - CRITICAL: every returned event carries the DDB event_id as `id` with a suffix (:start, :content, :end, :args, :call, :step-started, :step-finished, :thinking, :msg) so multi-frame groups dedup distinctly AND the SSE client's cursor advances on replay. - agUiToSemantic(ev): inverse translator for formatter rendering (Option A boundary — text output byte-identical between transports). watch.ts flow: 1. Snapshot fetch (getTaskEvents + getTask in parallel) serves three purposes: detect already-terminal task → print tail + exit; print history tail for late joiners; seed initialCatchUpCursor. 2. Transport selection: - --transport sse: runSseClient only; errors propagate. - --transport polling: pollTaskEvents only (Phase 1a behavior). - --transport auto (default): SSE first; on any error from runSseClient, fall back to polling with seeded cursor so no duplicates. Missing runtime_jwt_arn config → WARN + polling. 3. Post-SSE authoritative status: one more getTask after RUN_FINISHED/RUN_ERROR to set exit code from REST truth (COMPLETED → 0; anything else → 1). 4. Ctrl+C unification: single AbortController plumbed into both runSseClient({signal}) and pollTaskEvents' abortable sleep. 5. Consecutive-reconnect counter → WARN "network may be flaky" after 3 in a row. configure.ts: new --runtime-jwt-arn <arn> flag persisting to config. tryLoadConfig() helper enables partial-update merge — running bgagent configure --runtime-jwt-arn <arn> alone works on an already- configured install. First-time configure still requires the four core fields (post-merge check). CliConfig's runtime_jwt_arn is optional for back-compat; old config.json files without it load fine. Comprehensive structured logging (per user emphasis this step): - DEBUG: transport choice, config read, snapshot (N events), cursor seed, each SSE event (type + id), reconnect attempts, catch-up replay, fallback rationale. - INFO: transport selected, reconnecting, replayed N events, task completed/failed, clean exit. - WARN: fallback-to-polling (with reason), missing runtime_jwt_arn, duplicate skipped, consecutive-reconnect > 3. - ERROR: terminal RUN_ERROR fatal, config invalid, not authenticated, --transport sse unrecoverable. - JSON mode discipline: every log line routes to process.stderr; stdout stays pure NDJSON. Test 11 asserts every stdout line JSON.parses cleanly. Tests: 51 new (125 baseline → 176 total, 14 → 15 suites). - ag-ui-translator.test.ts: 29 tests (10 per-type translator + 11 agUiToSemantic + 6 round-trip + 2 edge). - watch.test.ts: 18 new transport tests covering all flag combinations, happy/sad paths, missing-config, already-terminal, SIGINT, JSON mode discipline, catch-up id injection, fallback ordering + no-duplicates. - configure.test.ts: 4 new tests (flag accepted, merge, first-time required-field check, back-compat old config). No CDK changes. No agent changes. No design-doc prose changes. mise //cli:build clean. No new runtime deps. Step 7 risks (for the parent to validate in E2E): - AG-UI messageId/toolCallId mismatch between live SSE (fresh ULIDs) and catch-up REST (DDB event_id with suffix). Expected — live wins dedup-wise, catch-up fills gaps. Verify no double-rendering on a mid-turn reconnect. - Polling snapshot re-render on SSE-fallback: cursor carries forward the SSE run's last emitted event id's DDB prefix; claimed correct + verified by Test 15, but worth eyeballing during E2E.

E2E of Phase 1b revealed that the two endpoints require different Cognito JWT types because they check different claims: - API Gateway's Cognito authorizer validates the 'aud' claim, which only Cognito ID tokens carry. It 401s on access tokens. - AgentCore Runtime's customJWTAuthorizer (configured via RuntimeAuthorizerConfiguration.usingCognito(pool, [appClient]) in Step 1 CDK) renders CloudFormation with allowedClients set, which AgentCore validates against the 'client_id' claim — only present on access tokens. It rejects ID tokens. Before this fix the CLI cached only the ID token and sent it to both endpoints. REST worked; SSE to Runtime-JWT failed with UNAUTHORIZED. The SSE client's 401 handler retried once and the --transport auto path correctly fell back to polling, so the user-visible failure was a warning line + a transport downgrade, not a hard error. But the primary SSE path was effectively unusable. Fix: - Cache BOTH IdToken and AccessToken from Cognito in ~/.bgagent/credentials.json (new optional access_token field; old credential files continue to load and fall back to id_token with a debug line). - Split the single getAuthToken() into: - getIdToken() → REST API (API Gateway Cognito authorizer) - getAccessToken() → AgentCore Runtime-JWT SSE endpoint - getAuthToken() kept as a backward-compat alias for getIdToken so api-client.ts and any other REST call path is unchanged. - watch.ts's SSE client closure now calls getAccessToken() specifically. - Login and refresh flows require both tokens from Cognito; tests updated to mock AccessToken alongside IdToken. Verified with curl: - API Gateway + ID token → HTTP 200 - API Gateway + access token → HTTP 401 (WHY we keep id_token for REST) - AgentCore Runtime-JWT + ID token → would 401 (WHY we need access token) - AgentCore Runtime-JWT + access token → JWT authenticates; request reaches the container (separate unrelated container 502 under investigation). Tests: 176 → 177 (+1: legacy credentials fallback). All CLI tests pass. No CDK, agent, design-doc, or diagram changes. Users must re-run 'bgagent login' once to populate access_token in their credentials file; otherwise SSE falls back to polling with a hint.

Exercising the SSE path end-to-end against the deployed stack surfaced three production bugs that blocked streaming from reaching the client. Each is fixed and verified: the latest test run received the full live stream of 74 AG-UI events (turns, tool calls, tool results, cost, milestones) with correct timestamps and a real PR created at scoropeza/agent-plugins#10. Two known issues remain for a fresh session — see "Remaining work" in the doc + memory entries updated elsewhere. Bugs fixed in this commit: 1. agent/src/server.py — _debug_cw was blocking module import with synchronous boto3 CloudWatch Logs writes. AgentCore runs a /ping health check within ~1s of container boot; if uvicorn hasn't bound port 8080 by then, the container is marked unhealthy and the runtime returns 424 "Runtime health check failed or timed out". Fix: CW writes run in a daemon thread (fire-and-forget), and the boot-time log uses plain print() to avoid spawning any thread during import. The debug path (used after the first request arrives) still writes to CloudWatch. 2. cli/src/sse-client.ts — was sending the 26-char task_id ULID as the X-Amzn-Bedrock-AgentCore-Runtime-Session-Id header, but AgentCore validates >=33 chars and returns HTTP 400 Bad Request. Fix: new buildRuntimeSessionId() helper that prefixes the task_id with 'bgagent-watch-' (40 chars total). Deterministic so reconnect attempts re-use the same session and AgentCore routes back to the same microVM (preserving in-progress session state). 3. cli/src/commands/watch.ts — AG-UI translator mints suffixed event ids like "01KPPVWM...:step-started" to keep TEXT_MESSAGE/ TOOL_CALL triads dedup-unique. On SSE reconnect, the CLI was passing the full suffixed id to GET /events?after=, which fails the ULID validator added in Step 4 with "Invalid `after` parameter: must be a 26-character ULID." Fix: strip ':suffix' in the catchUp closure before calling apiClient.catchUpEvents. Debug infrastructure added (stays enabled by default for the rest of Phase 1b development per explicit user direction — so adding new debug lines later doesn't require redeploy to toggle a log level): - _debug_cw helper in server.py writes to server_debug/<task_id> CloudWatch streams (AgentCore doesn't forward container stdout; only explicit CloudWatch Logs API writes land in APPLICATION_LOGS, matching the existing telemetry._emit_metrics_to_cloudwatch pattern). Wired through invoke_agent entry, _extract_invocation_ params result, sync vs SSE routing decision, _invoke_sse entry, _SSEAdapter construction, attach_loop, _spawn_background, _run_task_background entry, _sse_event_stream entry/RUN_STARTED yield/event translations/keepalive pings/close sentinel/disconnect. - Every CW write is non-blocking (daemon thread) so request latency is unaffected. - Local-stdout print() is retained for docker-compose runs. No test changes in this commit (tests still pass: agent 371, CLI 177, CDK 745). No new deps. Known remaining issues documented separately: (a) terminal RUN_FINISHED frame not delivered to the CLI after task completion — CLI hits repeated 424s and never detects terminality; (b) suspected duplicate-pipeline execution — REST-submit path fires orchestrator → Runtime-IAM pipeline AND the subsequent SSE invocation on Runtime-JWT spawns ANOTHER pipeline, yielding ~2× expected event volume in DDB. These are Phase 1b design-level gaps (not one-line fixes) flagged for the next session.

…esearch First live SSE bring-up against the deployed stack surfaced a design gap in the rev-4 plan: SSE invocations on Runtime-JWT were spawning a SECOND pipeline when the task had already been launched via the orchestrator on Runtime-IAM. Root cause: AgentCore's "same session_id → same microVM" routing is per-runtime-ARN only; cross-runtime live attach requires an external pub/sub layer. Competitive survey (docs/research/agent-streaming-patterns.md, new; 31 sources across CopilotKit, LangGraph Platform, OpenAI Assistants, Mastra, Temporal, Vercel resumable-stream, AgentCore itself) identified three dominant shapes: same-process streaming (used by CopilotKit / Mastra / OpenAI), orchestrator+observer with pub/sub (LangGraph join_stream, Vercel resumable-stream), and pull-based (Temporal, OpenAI fallback). LangGraph Platform documented as the clearest reference for orchestrator+observer (Postgres log + Redis pubsub); Vercel resumable-stream is the simpler AWS-friendly equivalent. Branch A chosen for Phase 1b (§9.13, new in rev 5): - Path 1 — `bgagent submit --watch` / `bgagent run`: direct-submit via POST /v1/tasks with `execution_mode=interactive` → Lambda admits + writes TaskTable, SKIPS orchestrator → CLI opens SSE to Runtime-JWT → server.py runs pipeline same-process. True real-time. Reconnection within microVM lifetime attaches to existing adapter via new {task_id: adapter} registry (attach-don't-spawn). - Path 2 — `bgagent submit` plain: default `execution_mode=orchestrator` keeps Phase 1a behaviour (pipeline on Runtime-IAM). CLI `watch` on such a task = polling (no cross-runtime live attach; that's Phase 1c). - Non-interactive (Slack/webhook/cron): Path 2 plus DDB Streams fan-out per §8.9. Trade-offs documented: - Pipeline lifetime for both paths bounded by AgentCore maxLifetime (8h in our CDK). DDB persists log only, not execution continuation. - Tasks > 8h need to leave AgentCore Runtime (Fargate / Step Functions, out of Phase 1b scope). - Direct-submit task dies with its microVM if CLI disconnects long enough for idle / maxLifetime eviction; orchestrator-submit is indifferent to CLI presence. New concrete deliverables flagged in the doc (to be implemented next): 1. CDK: add `execution_mode` field to POST /v1/tasks. 2. CreateTask Lambda: if `interactive`, skip orchestrator Lambda.Invoke. 3. server.py: attach-don't-spawn logic via {task_id: _SSEAdapter} registry. 4. server.py: /ping HealthyBusy while pipeline thread alive. 5. CLI: bgagent submit --watch (or bgagent run) flow — POST /v1/tasks with execution_mode=interactive, then open SSE to Runtime-JWT. Phase 1c roadmap: add pub/sub (IoT Core MQTT or ElastiCache Redis + resumable-stream port) for real-time cross-runtime attach. Additive; Branch A code does not change. Diagram updates to v2 (surgical text edits, no layout change): - Page 2 caption: explicit two-path architecture. - Page 5 caption + sequence arrow: reflect same-process pipeline on Runtime-JWT for interactive path. Research brief: docs/research/agent-streaming-patterns.md, ~2300 words, 31 sources, flagged time-sensitive claims (especially AgentCore quotas). No code changes in this commit.

… + multi-subscriber Implements the server-side Branch A design from §9.13: 1. _SSEAdapter multi-subscriber fan-out (sse_adapter.py): - New _Subscriber dataclass (queue + dropped_count). - Default subscriber created eagerly at __init__ (backward-compatible get() still works even when write_*() fires before first get()). - New subscribe() returns a fresh per-observer queue; unsubscribe(q) removes it. - _broadcast_from_loop writes to all subscriber queues with per-sub drop-oldest backpressure (one slow consumer can't stall others). - _broadcast_sentinel_from_loop fans out close sentinel to all. - dropped_count is the sum of per-sub drops + _undelivered_count (no-loop / closed cases). 2. server.py attach-don't-spawn logic: - New module-level _active_sse_adapters: dict[str, _SSEAdapter] tracked under _threads_lock. - _invoke_sse checks the registry first: if an adapter with active subscribers exists for this task_id, subscribe() a new queue and return a StreamingResponse backed by it — do NOT call _spawn_background. Solves the duplicate-pipeline bug (two pipelines running the same task, observed in the previous bring-up run). - Spawn path now: register in _active_sse_adapters BEFORE spawning (so a rapid reconnect race attaches instead of double-spawning), then subscribe() the observer's queue BEFORE spawning (so no events are missed between spawn and first drain iteration). - _run_task_background's finally block removes the adapter from the registry. Only removes if the current entry matches — guards against a newer adapter having replaced it. - Rollback on spawn failure: adapter removed from registry if _spawn_background raises. 3. _sse_event_stream takes sub_queue per-observer parameter: - Drains THIS observer's queue (not the adapter's default) so multiple observers attached to the same pipeline each receive the full event stream independently. - On client disconnect / generator cancellation, calls adapter.unsubscribe(sub_queue) — leaves the adapter and other subscribers intact. The background pipeline keeps running and ProgressWriter keeps writing to DDB. 4. /ping HealthyBusy (§9.13.2 idle-evict defense-in-depth): - Returns {"status": "HealthyBusy"} while any pipeline thread is alive, signalling AgentCore not to idle-evict the microVM. Per AWS runtime-long-run docs. - Returns {"status": "healthy"} otherwise. - 503 + "unhealthy" on _background_pipeline_failed unchanged. - _last_ping_status module-level: /ping transitions logged to CW once (healthy <-> HealthyBusy <-> unhealthy) so the stream is not spammed with per-probe lines. 5. Tests — 371 baseline → 377: - test_ping_reports_healthy_when_idle. - test_ping_reports_healthybusy_when_pipeline_alive. - test_sse_attach_does_not_spawn_second_pipeline: registry pre-populated, _invoke_sse returns StreamingResponse WITHOUT calling _spawn_background; subscriber_count increments. - test_multi_subscriber_broadcast: two subscribers both receive the same event. - test_multi_subscriber_close_sentinel_fans_out: close() sentinel reaches every subscriber. - test_registry_cleanup_on_pipeline_completion: finally block removes the adapter from the registry. - Existing test_sse_stream_client_disconnect_calls_detach renamed and updated to verify unsubscribe(queue) instead of detach_loop. Comprehensive CW debug logs at every state transition (registry insert / remove / attach / rollback, ping status transitions, subscribe / unsubscribe). All via the existing _debug_cw helper (writes in daemon thread so container startup is not blocked). No CLI changes, no CDK changes, no design-doc changes (already landed in 48837af). Part 1 (CDK admission execution_mode + bgagent run command) will ship in a separate commit. Known gap: the CDK POST /v1/tasks still unconditionally fires the orchestrator Lambda when a task is created, so `bgagent run` (Part 1, not yet implemented) cannot yet use this path to submit a task without ALSO firing the orchestrator. Part 1 adds the execution_mode field to skip the orchestrator invoke when the CLI is going to drive the pipeline itself via SSE.

…xport helpers Partial landing of the CDK admission + CLI helper-export work for rev 5 Branch A. Stops short of the final `bgagent run` command — that needs one more session of work. Everything below is test-green and uncommitted branches are fine for handoff. CDK changes: 1. types.ts + cli/src/types.ts (kept in sync per AGENTS.md): - New ExecutionMode = 'orchestrator' | 'interactive'. - CreateTaskRequest now has optional execution_mode. 2. create-task-core.ts: - New optional allowedExecutionModes parameter (default ['orchestrator']). Validates body.execution_mode against the allowlist; 400 VALIDATION_ERROR with a clear message if not allowed. - execution_mode='interactive' SKIPS the orchestrator Lambda.Invoke and logs "Admission: interactive mode, orchestrator invoke skipped". Default/undefined/='orchestrator' preserves Phase 1a behaviour byte-for-byte. 3. create-task.ts (Cognito-authed POST /v1/tasks): - Passes ['orchestrator', 'interactive'] → both modes allowed. - webhook-create-task.ts unchanged → default ['orchestrator'] → webhook callers cannot request interactive mode (returns 400). 4. Tests (create-task-core.test.ts): 5 new cases — interactive skips orchestrator, explicit orchestrator still fires, undefined is orchestrator (regression), webhook rejects interactive, bogus mode rejected. Full CDK suite: 750 passed / 0 failed (745 baseline + 5 new). CLI changes (helper exports for the upcoming `bgagent run` command): 5. cli/src/commands/watch.ts: - Exported makeFormatter, fetchInitialSnapshot, runSse, RunSseArgs so a new run command can reuse the same SSE-watch machinery without duplicating the renderer / reconnect / terminal-state logic. - No behaviour change; purely public-API widening. 6. cli/test/sse-client.test.ts: updated the session-id header expectation to match the buildRuntimeSessionId() prefix (fix landed earlier in cd093b2 — this test was still expecting the bare task_id). Not yet in this commit (next session): - cli/src/commands/run.ts — composes apiClient.createTask({execution_mode: 'interactive'}) + runSse(). Should be a small file (~100-150 lines). - cli/src/bin/bgagent.ts — register run command alongside submit/watch. - cli/src/api-client.ts — verify createTask forwards execution_mode (current shape spreads ...CreateTaskRequest so it already does, but needs a test assertion). - cli/test/commands/run.test.ts — new tests for the run flow. - Deploy + E2E validation on the deployed stack. Test counts: CDK 745 -> 750 (+5). CLI unchanged at 177 (no new tests yet for run command). Agent unchanged at 377 (Part 2 committed in 3dab225 already).

…RE guard Completes Phase 1b rev-5 Branch A (design §9.13). Adds the direct-submit interactive CLI path and the cross-runtime safety guard that prevents duplicate pipeline execution when a watcher opens SSE against a task that was submitted via the orchestrator. CLI — `bgagent run`: * New command composes createTask({execution_mode: 'interactive'}) + runSse so the pipeline executes same-process with the SSE stream on Runtime-JWT (real-time, no orchestrator hop). * Requires runtime_jwt_arn in config; errors with a clear pointer to `bgagent configure` otherwise. * Registered in bin/bgagent.ts between `submit` and `list`. * 11 tests in cli/test/commands/run.test.ts. * 1 api-client test: execution_mode is forwarded in POST body. RUN_ELSEWHERE guard (§9.13.4): * CDK: TaskRecord gains execution_mode; create-task-core persists it on the TaskTable record so server.py can read it. 1 new test asserts persistence for both interactive and orchestrator modes. * Agent: _invoke_sse now checks task_state.get_task() before spawning. If the task record says execution_mode != 'interactive', returns HTTP 409 {code: RUN_ELSEWHERE, execution_mode} instead of spawning. Fails open when record is missing (blueprints / legacy tasks). 3 new tests cover orchestrator-rejected, interactive-allowed, and fail-open behaviours. * CLI: sse-client recognizes 409 RUN_ELSEWHERE as a non-retriable error, throws CliError so `watch --transport auto` catch-block falls back to polling cleanly (no reconnect storm). Generic 409s still take the regular reconnect path. 2 new sse-client tests cover both paths. Test status: agent 380 (was 377), CDK 751 (was 750), CLI 191 (was 177).

Live bring-up of the rev-5 Branch A path surfaced three production bugs after the initial rev-5 deploy. All three are fixed here and validated end-to-end against backgroundagent-dev/us-east-1. 1. CDK: RuntimeJwt execution role was missing ECR pull perms. Root cause: the L2 `AgentRuntimeArtifact.fromAsset` AssetImage.bind() guards against double-grant with `this.bound = true`, so when the SAME artifact instance is attached to two Runtimes the second runtime's role never receives the ECR pull statements. Image pull failed with "no basic auth credentials", AgentCore then returned 424 Failed Dependency on every /invocations for the JWT runtime. Fix: create two AssetImage instances (one per runtime). The DockerImageAsset dedupes on asset hash so only one image is published to ECR. 2. Agent: server.py on Runtime-JWT could not run pipelines for CLI-submitted tasks. The rev-5 design (§9.13.4) requires server.py to hydrate repo_url + task description + max_turns / max_budget / task_type / branch_name / pr_number from TaskTable when the SSE body only carries `{task_id}`. Previously `_extract_params` read only from the invocation body, so pipelines blew up with `ValueError: repo_url is required` immediately after spawn. Fix: extend the RUN_ELSEWHERE guard block in `_invoke_sse` to also hydrate missing params from the TaskTable record when the mode is `'interactive'`. Preserves the orchestrator contract (which passes full params in the invocation body — we only fill fields the caller didn't send). 3. CLI+CDK: `bgagent watch --transport auto` against an orchestrator- submitted task hit a reconnect storm. server.py correctly returns 409 RUN_ELSEWHERE, but AgentCore wraps non-2xx responses as 424 to the client, so the CLI never saw the 409 code and fell into generic http_error retry. Design-correct fix: expose `execution_mode` on the TaskDetail REST response so watch can choose the transport from the snapshot instead of probing Runtime-JWT. watch.ts now routes directly to polling whenever `snapshot.executionMode !== 'interactive'` under `--transport auto`. The 409 RUN_ELSEWHERE guard stays server-side as defence-in-depth. CDK: TaskDetail gains `execution_mode: ExecutionMode | null` (null for legacy tasks that predate rev 5). `toTaskDetail` forwards the field. CLI: watch.ts SnapshotResult gains `executionMode`. Two new tests cover orchestrator-mode → polling and legacy (null) → polling. E2E results against backgroundagent-dev: * E2E-A (`bgagent run`, interactive): task 01KPRN3XKQ2V4AVNT6GV1E3PEN → PR scoropeza/agent-plugins#11, COMPLETED in 54s, 61 real-time events streamed, exit 0. * E2E-B (`submit` + `watch --transport auto`, orchestrator): snapshot resolves execution_mode=orchestrator, CLI skips SSE, polling streams turn events and tool calls, final status FAILED (max-turns exhausted — orchestrator-side behaviour unaffected by this fix), exit 1. No 424 reconnect storm. Tests: agent 380, CDK 751, CLI 193. TS typecheck clean.

… nits) Multi-agent validation pass (code-reviewer, silent-failure-hunter, pr-test-analyzer, type-design-analyzer) surfaced five P0 concerns and several key nits. Per user direction (Option C — no cutting corners), land all P0s inline + key nits; defer P1s and design refactors to `docs/design/PHASE_1B_REV5_FOLLOWUPS.md` with full rationale. P0 fixes: * P0-a — latent AttributeError in _SSEAdapter.write_agent_error: referenced non-existent `self._dropped_count` inside a "last-ditch" except block. Fix to the real `_undelivered_count` + regression test (test_write_agent_error_fallback_uses_undelivered_counter). * P0-b — task_state.get_task conflated NotFound with FetchFailed (both returned None via bare `except Exception`). Rev-5 RUN_ELSEWHERE guard on Runtime-JWT would spawn a duplicate pipeline on orchestrator-mode tasks during a DDB blip. Introduce new `TaskFetchError` exception; `get_task` raises on DDB/boto failures, returns None only on real absence. server.py `_invoke_sse` now returns 503 TASK_STATE_UNAVAILABLE on fetch failure (fail-closed) and still fails-open for legacy tasks with no record. Four new tests cover the three return shapes + the 503 response. * P0-d — `bgagent run` wraps `runSse` in try/catch. On fatal SSE error: cancels the task (so it doesn't sit stranded in SUBMITTED occupying a concurrency slot), prints the task_id + a `bgagent status <task_id>` resume hint, and exits non-zero. Three new tests: SSE-fails-immediately-and-cancels, cancel-also-fails- bubbles-SSE-error, SIGINT-propagates-abort-to-runSseClient. * P0-e — post-hydration validation in server.py `_invoke_sse`. New `_validate_required_params` helper checks for the minimum viable param set (repo_url; new_task needs issue OR description; pr_iteration/pr_review need pr_number). Returns 500 TASK_RECORD_INCOMPLETE with a list of missing fields instead of letting the pipeline crash inside setup_repo. Two new tests (hydration-fills-missing, explicit-caller-wins-over-record) + one for the 500 + one for the helper's type-dispatch logic. Key nits: * Lift `validateStreamTimeout` + `DEFAULT_STREAM_TIMEOUT_SECONDS=3500` into `cli/src/commands/_stream.ts` (shared between run + watch). * `SnapshotResult.executionMode: string | null` → `ExecutionMode | null` (compile-time exhaustiveness restored). * `TaskDetail.execution_mode?:` in CLI → required `ExecutionMode | null` (matches CDK side, which always sets it). * server.py now exports `EXECUTION_MODE_ORCHESTRATOR` and `EXECUTION_MODE_INTERACTIVE` constants; inline string literals replaced. * Heartbeat 45s magic number → `_HEARTBEAT_INTERVAL_SECONDS`. * `logInfo` no-op `if/else` branches in watch.ts removed; parameter renamed `_isJson` for call-site documentation. * `run.ts` `logInfo(message)` signature → `logInfo(isJson, message)` to match watch.ts; `Verbose mode:` line gated on `isVerbose()`. * CDK two-artifact workaround comment: replace `<check>` placeholder with a pointer to `DEFER_FOLLOWUPS.md` CDK-1 + detailed location of the L2 bug (`AssetImage.bind` guard in the vendored module). Tests added: * agent: +9 tests (hydration-full, hydration-preserves-caller, 503- TASK_STATE_UNAVAILABLE, 500-TASK_RECORD_INCOMPLETE, validator dispatch for pr_iteration/pr_review/new_task, get_task-notfound, get_task-found, get_task-empty-item, get_task-raises-TaskFetchError, write_agent_error-fallback). * cdk: +1 test (both runtime execution roles carry ECR pull perms — regression for the L2 double-attach bug). * cli: +3 tests (run-SSE-fatal-cancels-task, run-cancel-also-fails- bubbles-SSE-error, run-SIGINT-forwards-abort-to-runSseClient). Other: * New docs/interactive-agents-phases-v3.drawio (generated by diagram-specialist) with execution_mode fork on POST /tasks, RUN_ELSEWHERE guard + hydration annotation, Path-1 vs Path-2 sequence, per-microVM reconnection-attach note, and a "Rev-5 invariants" callout. v2 kept as a reference baseline. * New docs/design/PHASE_1B_REV5_FOLLOWUPS.md catalogs the 14 deferred items (P0-c stranded-task reconciler, all P1s, type refactors, observability gaps, polling-interval design deviation, upstream L2 bug filing). Test totals: agent 380 → 390 (+10), CDK 751 → 752 (+1), CLI 193 → 196 (+3). TS typecheck clean. Re-validated against deployed backgroundagent-dev: * POST-P0 E2E-A (interactive/JWT/SSE) — task 01KPS5A90S6PPX8D9083BCC8P3 → PR scoropeza/agent-plugins#14, COMPLETED, hydration log confirms 5 params filled from TaskTable. * POST-P0 E2E-B (orchestrator/IAM/polling) — task 01KPS6TM1K74QAV2ZGPBB8GDTG → snapshot reads execution_mode= orchestrator, CLI prints "using polling (SSE only supported for interactive tasks)", no SSE attempt, 33 turns streamed via polling, terminal status FAILED (max_turns=6 exhaustion — pipeline-side, unrelated to watch).

…S_PER_USER 3→10 Follow-up to fe84de5 pre-push hardening: land the last P0 from the multi-agent validation pass. User feedback during the review exposed a live instance of the stranded-task problem (2 interactive tasks occupying concurrency slots after CLI-side failures pre-P0-d), which made the case to fix it in-scope rather than deferring to a follow-up PR. Stranded-task reconciler: * New construct `cdk/src/constructs/stranded-task-reconciler.ts` wires an EventBridge schedule to a new Lambda running every 5 minutes by default. * New handler `cdk/src/handlers/reconcile-stranded-tasks.ts` queries `TaskTable.StatusIndex` for `status IN (SUBMITTED, HYDRATING)` with `created_at < cutoff`, classifies each row by `execution_mode`, and fails any exceeding its applicable timeout: - interactive: 300 s (STRANDED_INTERACTIVE_TIMEOUT_SECONDS) - orchestrator: 1200 s (STRANDED_ORCHESTRATOR_TIMEOUT_SECONDS) - legacy (missing): 1200 s (treated as orchestrator) Thresholds are env-tunable per Lambda without code changes. * Transition to FAILED is conditional on the current status to avoid racing a legitimate transition; on CCFE we log `advanced_during_ reconcile` and move on. * Emits two events on fail: `task_stranded` carrying `{code: 'STRANDED_NO_HEARTBEAT', prior_status, execution_mode, age_seconds}` for observability, and the standard `task_failed` that existing consumers already react to. * Decrements the user's concurrency counter conditionally. If the counter is already 0 (double-release or drift), we swallow the CCFE — the existing concurrency reconciler catches lingering drift. * Does NOT touch RUNNING / FINALIZING — those remain handled by `pollTaskStatus`'s `agent_heartbeat_at` timeout path in `orchestrator.ts`. * Handler `entry` uses `@aws-sdk/client-dynamodb` inline (not `lib-dynamodb`) to match the sibling reconciler pattern and keep the Lambda bundle lean. 8 new handler unit tests cover: no-candidates no-op, interactive stale → fails + events + decrement, orchestrator young → ignored, orchestrator old → fails, legacy no-mode → orchestrator threshold, race with legitimate transition (CCFE clean skip), both SUBMITTED + HYDRATING queried, pagination via `ExclusiveStartKey`. MAX_CONCURRENT_TASKS_PER_USER 3 → 10: * `cdk/src/constructs/task-orchestrator.ts:163` default raised. 3 was too tight for power-user CLI flows (`bgagent run` during iterative development triggered the cliff routinely). The stranded-task reconciler now prevents abandoned tasks from occupying slots, so the bump is ergonomic rather than load-bearing. * Existing test updated (`task-orchestrator.test.ts`). * The `?? 3` fallback → `?? 10` only; callers that pass an explicit value are unaffected. Docs: * `docs/design/PHASE_1B_REV5_FOLLOWUPS.md` — move P0-c from "deferred" to "LANDED" with the actual design delivered. Same for MAX_CONCURRENT_TASKS_PER_USER. Also add new entry DATA-1 (split DDB `turns` field into `turns_attempted` + derived `turns_completed` — user flagged during review; SDK reports attempted count including the cap-check, cosmetic not bug). Tests: agent unchanged at 390, CDK 752 → 760 (+8 reconciler handler tests; existing task-orchestrator test updated for new default), CLI unchanged at 196. TS typecheck clean. Deployed to backgroundagent-dev: StrandedTaskReconciler Lambda live, env verified (INTERACTIVE=300 ORCHESTRATOR=1200). Running every 5 minutes on EventBridge.

Round 1 of the pre-push follow-ups. Three correctness fixes from the silent-failure-hunter pass, plus the observability gap the concurrency incident exposed. Plan in docs/design/PHASE_1B_REV5_FOLLOWUPS.md. P1-3 — attach-path subscribe() failure no longer duplicates pipeline (agent/src/server.py): * When `_invoke_sse` finds a live adapter in the registry but `adapter.subscribe()` raises (close race, queue-lifecycle bug), we now return HTTP 503 `{"code": "SSE_ATTACH_RACE"}` instead of falling through to spawn. Falling through would have duplicated the pipeline inside a single microVM. * New test `test_sse_attach_subscribe_failure_returns_503_no_spawn` asserts no spawn is called and the registry is left untouched on subscribe failure. P1-1 — 409 on the SSE path is always terminal (cli/src/sse-client.ts): * Prior behavior: if the 409 body wasn't JSON or didn't carry `code: RUN_ELSEWHERE`, the code fell through to the generic reconnect path — i.e., retried against a server that was deliberately rejecting the request (reconnect storm). Now any 409 is terminal: RUN_ELSEWHERE → `CliError('RUN_ELSEWHERE...')` (caller falls back to polling), other 409 → `CliError('HTTP 409 ... body: ...')` with a 500-byte body excerpt so operators can see what the server said. Non-retriable. * Updated existing test and added `HTTP 409 with non-JSON body → terminal CliError`. OBS-4 — interactive path records session_id on TaskTable (agent/src/task_state.py + agent/src/server.py): * New helper `task_state.write_session_info(task_id, session_id, agent_runtime_arn)`. Conditional UpdateItem on `status IN (SUBMITTED, HYDRATING)` so a concurrent legitimate transition wins cleanly. * `_invoke_sse` calls it just before spawning the pipeline for `execution_mode='interactive'` tasks, passing the AgentCore session header value + (for now) an empty runtime ARN. * Cancel-task Lambda resolves the runtime ARN from `execution_mode` against two new env vars `RUNTIME_IAM_ARN` + `RUNTIME_JWT_ARN` (cdk/src/handlers/cancel-task.ts + cdk/src/constructs/task-api.ts). `StopRuntimeSession` now picks the correct runtime for interactive tasks that only have `session_id` on the record. * Four new tests in test_task_state.py (writes both fields, no-op on empty, ConditionalCheckFailed skip, partial session-only), two in test_server.py (spawn-writes-session-info + skip-when-both-missing). Important design note added to agent.ts: an earlier attempt injected each runtime's own ARN as an env var on the runtime itself. That creates a CFN cycle (the Runtime's properties reference its own `AgentRuntimeArn` GetAtt). The solution moved to cancel-task where both runtime ARNs are already in scope at CDK synth time with no cycle. `agent_runtime_arn` stays null on the interactive TaskTable record; cancel resolves from `execution_mode`. Test totals: agent 390 → 397 (+7), CDK 760 → 760 (existing tests unchanged), CLI 196 → 197 (+1 non-JSON 409 test). TS typecheck clean. Deployed to backgroundagent-dev and validated with a fresh E2E-A run: task 01KPSEF0G8C2141JFF0CVGY4CD → PR scoropeza/agent-plugins#15, COMPLETED, session_id "bgagent-watch-01KPSEF..." now recorded on TaskTable (was null on all prior interactive-path tasks).

Two follow-ups from the silent-failure-hunter pass that surface triage-critical information previously lost at debug-level only. P1-2 — post-SSE getTask failure no longer silent (cli/src/commands/watch.ts): Before, if the authoritative-status REST call after the SSE terminal event failed, the CLI silently inferred status from the AG-UI frame type, logged only via `debug()` (invisible without `--verbose`), and printed "Task completed." / "Task failed." without any indication that the status was inferred. In practice this means a user sees a green "completed" even when REST is down and the server might have finalized as CANCELLED or FAILED. Now on getTask failure we: * Emit a `WARN:` line to stderr with the underlying error and a `bgagent status <task_id>` suggestion. * Suffix the terminal line with ` (inferred)` so it is self-evident that the status may not match the server's view. * Exit code still uses the inferred status (nothing else to do). Two new tests: RUN_FINISHED + getTask fails → warn + inferred completed (exit 0); RUN_ERROR + getTask fails → warn + inferred failed (exit 1). P1-5 — rev-5 except blocks now include tracebacks in CloudWatch (agent/src/server.py): The AgentCore runtime container's stdout is not forwarded to APPLICATION_LOGS (server.py docstring already flags this). Before, every bare `except Exception` in the rev-5 paths logged only `{type(exc).__name__}: {exc}` via `_debug_cw`, losing the call-site traceback and making programming-bug triage expensive. * New helper `_debug_cw_exc(message, exc, *, task_id=None)` formats the full traceback and hands it to `_debug_cw` so it reaches the custom CloudWatch stream alongside the exception. * Replaced 6 rev-5-era call sites: `_extract_invocation_params FAILED`, `_invoke_sse FAILED`, attach-path `subscribe() FAILED`, spawn-path `_SSEAdapter construction FAILED`, `attach_loop FAILED`, `subscribe FAILED in spawn path`, `_spawn_background FAILED`. Removed now-redundant `traceback.print_exc()` calls at those sites (stdout was swallowed in production anyway). Test totals: agent 397 (unchanged — no new behavior assertion; the helper is exercised indirectly through the existing P1-3 + SSE construction tests), CDK 760 (unchanged), CLI 197 → 199 (+2 P1-2 tests). TS typecheck clean. No infra change — agent/src changes but no deploy needed until we actually want to exercise P1-5 in live CloudWatch.

…_usd specifically Pre-fix, a task that hit ``--max-turns`` or ``--max-budget`` showed the generic "Agent task did not succeed" reason in ``bgagent status``: Task 01KQ… — FAILED (20s total) Reason: agent: Agent task did not succeed … even though the agent's raw error_message already carried the specific ``agent_status='error_max_budget_usd'`` signal. The classifier at ``error-classifier.ts:205`` matched the outer ``Task did not succeed.*agent_status=`` pattern and swallowed the specific literal. Fix: add specific classifier patterns BEFORE the generic catch-all for the three concrete ``agent_status`` literals the agent emits (see ``agent/src/pipeline.py::_resolve_overall_task_status``): - ``agent_status=error_max_turns`` → TIMEOUT: Exceeded max turns - ``agent_status=error_max_budget_usd`` → TIMEOUT: Exceeded max budget - ``agent_status=error_during_execution`` → AGENT: Agent errored during execution The new titles surface the cap hit directly in ``bgagent status``'s Reason line and point users at the concrete remedy (``--max-turns`` / ``--max-budget`` on the submit call). Category is TIMEOUT for the caps (consistent with the existing TIMEOUT classifier for orchestrator poll timeouts) and AGENT for mid-turn errors. Tests: +4 regression tests covering each specific literal, the ordering guarantee vs. the generic catch-all, and quoted-vs-unquoted variants (the agent currently emits single-quoted repr values but a future refactor could drop them). CDK suite: 1000 passing. Refs: PR aws-samples#52 hands-on testing (Scenario 11 generic-reason feedback)

…king flag Pre-fix, ``bgagent status <id>`` and ``bgagent status <id> --wait`` rendered fundamentally different views of the same task: - Default: the compact ``formatStatusSnapshot`` template - ``--wait``: the verbose ``formatTaskDetail`` key-value dump Hands-on testing surfaced the UX trap: users asking "why does the terse snapshot not show me the Type / error Reason?" had to learn that adding a blocking flag was the only way to unlock a richer view. The previous rationale ("snapshot is recency-biased and less useful once the task has landed") is weaker now that commit 78f6cda added Type + Reason to the snapshot itself — the snapshot is already the complete human view. New contract: ``--wait`` is a pure blocking flag. Both paths render the SAME ``formatStatusSnapshot`` output. Only the blocking behavior (and the status-derived exit code) differ. Behavior matrix: status <id> → snapshot now status <id> --wait → snapshot at terminal status <id> --output json → TaskDetail JSON now status <id> --wait --output json → TaskDetail JSON at terminal JSON mode is unchanged so scripting consumers see no contract break. Changes: - ``cli/src/commands/status.ts``: drop the ``formatTaskDetail`` branch in the ``--wait`` path. When ``--wait`` lands a terminal task, fetch the recent-events window and pass it through ``formatStatusSnapshot`` (cheap follow-up call; the task has already terminated so the window is stable). - ``--wait`` flag description clarified to state it's a blocking + exit-code flag, not a layout switch. Tests: +2 regression tests covering (a) ``--wait`` renders the same snapshot layout as the default path, and (b) ``--wait --output json`` still returns raw ``TaskDetail`` unchanged for scripting. CLI suite: 183 passing (was 181). Carry-forward: ``bgagent submit --wait`` uses the same ``waitForTask`` helper but its final render path is simpler (already just a single-task dump). Left unchanged for this commit — its ``--wait`` semantics are already honest (block-until-terminal with a progress indicator). Refs: PR aws-samples#52 hands-on testing (status ``--wait`` naming feedback)

Hands-on testing uncovered a data-visibility bug: ``channel_source`` is computed by ``create-task-core.ts`` (``api`` for CLI / Cognito submits, ``webhook`` for HMAC-signed webhook submits) and persisted on the DDB TaskRecord, but ``toTaskDetail`` never mapped it into the API response and neither the CDK ``TaskDetail`` type nor the CLI mirror declared the field. Every ``GET /tasks/{id}`` response silently dropped the provenance signal. Observed impact: Scenario 12 of PR aws-samples#52 deploy-validation asked ``bgagent status <webhook-created-task> --output json | jq .channel_source`` to verify the webhook path. That returned ``null`` even though the CloudWatch log for the handler clearly showed ``channel_source: "webhook"`` at creation. Fix: - ``cdk/src/handlers/shared/types.ts::TaskDetail``: declare ``channel_source: string``. Non-optional — every TaskRecord has the field set at creation time (the ``create-task-core`` writer has required it since day one), so the API guarantee is honest. - ``cdk/src/handlers/shared/types.ts::toTaskDetail``: map ``record.channel_source`` through verbatim. - ``cli/src/types.ts::TaskDetail``: mirror the field per the documented CLI types-sync contract. - Test fixtures in ``cli/test/format.test.ts`` and ``cli/test/format-status-snapshot.test.ts`` get ``channel_source: 'api'`` to satisfy the non-optional declaration. - ``cdk/test/handlers/get-task.test.ts`` grows two regression guards: (a) the default response carries ``channel_source`` through, and (b) a webhook-flagged TaskRecord surfaces ``channel_source=webhook``. Pins the previously-silently-dropped field. Not in scope for this commit: - No change to the compact ``formatStatusSnapshot`` rendering. Channel provenance is scripting / audit metadata, not a status-line concern. Available via ``bgagent status <id> --output json | jq .channel_source``. - ``TaskSummary`` (the list-response shape) intentionally omits ``channel_source`` — the list view is already terse. CDK suite: 1001 passing (was 1000; +1 from the webhook-flagged regression test, default-case assertion folded into the existing ``returns task detail successfully`` test). CLI suite: 183 passing. Refs: PR aws-samples#52 hands-on testing (Scenario 12 webhook verification)

Hands-on testing feedback: after earlier PR aws-samples#52 polish added Type and Reason to the default status snapshot (commit 78f6cda), the two remaining fields users had to chase through ``--output json`` were ``channel_source`` (useful to tell webhook-submitted tasks apart from CLI-submitted tasks at a glance) and ``task_description`` (the prompt the user actually typed — the highest-signal reminder of what this task is about). Changes in ``cli/src/format.ts::formatStatusSnapshot``: - ``Channel:`` — always shown (``api`` / ``webhook``), immediately after ``Repo:``. Unlike ``Type:`` which is conditional on a non-default ``task_type``, provenance is always meaningful and always one word, so the line is cheap and uniform. - ``Description:`` — shown when ``task_description`` is populated, between ``Type:`` and ``Turn:``. Word-wrapped to 60-char columns with a 17-char continuation indent so a long prompt stays readable at 80-col terminals without truncating the user's input. Whitespace inside the description collapses to single spaces; no other reflow. Intentionally NOT adding a ``Duration:`` row: the header already shows ``(<elapsed>)`` for running tasks and ``(<total>)`` for terminal tasks via ``elapsedDescription``. A dedicated Duration line would duplicate that signal. Examples: Running task, api channel, short description (fits on one line): Task 01KQ… — RUNNING (47s elapsed) Repo: scoropeza/agent-plugins Channel: api Description: Make a small tweak to README.md Turn: 7 / 20 ... Running task, webhook channel, long description (wraps): Task 01KQ… — RUNNING (12s elapsed) Repo: scoropeza/agent-plugins Channel: webhook Description: This is a much longer description of a task that the user submitted and needs to be wrapped across multiple lines rather than truncated Turn: 2 / 20 ... Tests: +7 regression tests covering Channel rendering for api / webhook / empty-degrade-to-placeholder, Description short / wrapped-multi-line / null-omits-line / whitespace-trimming. The golden ``happy path renders the full template`` snapshot updated to include both new rows. CLI suite: 190 passing (was 183). Refs: PR aws-samples#52 hands-on testing (snapshot visibility follow-up)

… async-only Two combined changes in one file, committed together because pages 1-4 and page 5 lived in the same drawio multi-page file: 1. Full regeneration of pages 1-4 for the rev-6 async-only shape. The previous 11-page file predated PR aws-samples#52's rev-6 rework. It showed SSE streaming, two-runtime splits, a mid-turn interrupt proposal, and a WebSocket upgrade proposal that were all removed or superseded during this PR. Pages 1-4 now render: 1. Rev-6 async-only architecture (full topology — CLI + webhook → API Gateway → 7 Lambda handlers → OrchestratorFn → AgentCore Runtime → TaskEventsTable spine → FanOutConsumer → per-channel dispatchers, with an explicit "Removed in rev-6" callout). 2. ``bgagent watch`` adaptive polling (500 ms active, 1/2/5 s idle ladder, cursor advancement). 3. Nudge flow (combined-turn ack per AD-5, with SDK-constraint rationale box). 4. ``bgagent status`` + ``bgagent trace download`` (deterministic StatusFn, presigned S3 trace URL). 2. Page 5 rewrite for the ``If-Match`` removal (commit 2c2eda0). Removed: ``PATCH ... If-Match: <etag>``, 412 retry branch, GET- before-PATCH cold start, ``github_comment_etag`` persistence, ``last_etag`` UpdateItem. Added: single-shot PATCH (no conditional headers), ``agent_milestone`` unwrap with ``ROUTABLE_MILESTONES`` allowlist note, DDB numeric coercion note, ``ConditionExpression`` guards on the persist path (``attribute_not_exists(github_comment_id)`` for first POST, ``github_comment_id = :prev`` for 404 re-POST), explicit ``fanout.dispatcher.rejected`` error-isolation callout, and an "AD-9 rationale" box citing the HTTP 400 finding that motivated the rewrite. File size: 277 KB → 107 KB (−61%). Page count: 11 → 5. The dropped pages were either obsolete (SSE / WebSocket / mid-turn-interrupt) or covered in dedicated files (Phase 3 Cedar HITL stays in ``phase3-cedar-hitl.drawio``). Not touched: - ``docs/src/content/docs/`` (generated; not affected by drawio). - ``phase3-cedar-hitl.drawio`` (separate file, still accurate). Refs: PR aws-samples#52 rev-6 async-only rework + Scenario 7-extended If-Match finding.

The upstream blueprint at ``cdk/src/stacks/agent.ts`` points at ``krokoko/agent-plugins``. During PR aws-samples#52 deploy-validation it was flipped to ``scoropeza/agent-plugins`` so the test stack would onboard Sam's own fork for hands-on testing. That override leaked into an earlier PR commit (``1c87094``) and has to come back off before push. This commit is a pure revert of the single ``repo:`` line. Everything else on the branch stays as-is.

scoropeza · 2026-05-04T23:55:11Z

Hi @krokoko — this PR is ready for another look.

Since we locked the rev-6 async-only architecture on 2026-04-30, the branch has been fully implemented, deployed to a dev stack, and E2E-validated. A few notable things happened along the way:

Deploy validation caught three real runtime bugs (all fixed on-branch)

All three passed 100% of the unit tests but failed on live AWS. Documented in the updated PR description:

costUsd.toFixed is not a function — DDB Document-client returns Number as strings; fanout dispatcher was not coercing. Every terminal event on pr_iteration / pr_review was silently rejected in production.
agent_milestone wrapper events never routed — fanout filter matched event_type literally; missed pr_created wrapped in an agent_milestone event. Zero comments ever posted on any PR.
GitHub rejects If-Match on PATCH /issues/comments/{id} with HTTP 400. The ETag optimistic-concurrency scheme from the original §6.4 design never worked in production. AD-9 and §6.4 have been rewritten to rely on DDB-Stream ParallelizationFactor: 1 + single-writer invariant instead. The "rejected alternative" is documented in the design doc with the verbatim GitHub error string so the next person to reach for ETag hits the explanation immediately.

Hands-on testing — 15 scenarios against `scoropeza/agent-plugins`

All CLI surfaces validated end-to-end: bgagent submit / list / status / cancel / nudge / events / watch / trace download / webhook, plus --trace / --wait / --force / --idempotency-key / --verbose / --max-turns / --max-budget / --pr / --review-pr. Each hands-on session also surfaced UX gaps that are now fixed on-branch (snapshot missing Type / Channel / Description / Reason; generic error classifications; channel_source silently dropped from API responses; --wait formatter bifurcation; stale watch terminal message).

Test baselines updated

CDK: 1001 passing (was 879)
CLI: 190 passing (was 215 — count dropped because we deleted SSE tests during rev-6, then rose again with the hardening)
Agent: 468 passing

Phase 3 Cedar HITL — design + plumbing only, no engine

docs/design/PHASE3_CEDAR_HITL.md (1,883 lines) and the 12-page companion diagram are fully locked. Orchestrator + agent payload shape threads cedar_policies through end-to-end so the next PR that implements the engine doesn't need to touch the transport layer. No Cedar evaluation; no approval flow.

Follow-ups

6 carry-forward issues drafted to .l4-issue-drafts/ on the branch. Notable ones for post-merge:

failTask double-emits task_failed + over-decrements concurrency on Durable Execution step retry (pre-existing bug this PR surfaced; documented with root cause + fix plan).
Bedrock Guardrail PROMPT_ATTACK over-triggers on imperative-mood prompts even at MEDIUM (several example prompts captured).
Idempotent submit returns DUPLICATE_TASK error instead of the original task_id — non-standard semantics.
webhook_url missing from CreateWebhookResponse.

Ready for review whenever you have time. Happy to walk through any section or specific commit in a sync if that's faster.

krokoko · 2026-05-05T00:43:25Z

Awesome, just a few findings from automated review, after that let's merge

Fanout handler never returns DynamoDBBatchResponse despite reportBatchItemFailures: true

[CDK Code Review] cdk/src/constructs/fanout-consumer.ts:146 + cdk/src/handlers/fanout-task-events.ts:697

The construct sets reportBatchItemFailures: true but the handler returns void. When any record fails (e.g., resolveTokenSecretArn throws
AccessDeniedException), Lambda retries the entire batch instead of isolating the poisonous record. This defeats the per-task ordering guarantee and can cause
cascading retries.

Fix: Change handler return type to Promise, wrap per-record processing in try/catch, and return { batchItemFailures: [{itemIdentifier:
record.eventID}] } for failed records.

turns_attempted / turns_completed bypass coerceNumericOrNull — same bug class as the fixed costUsd.toFixed crash

[Type Design + Error Handling] cdk/src/handlers/shared/types.ts:304-305

The toTaskDetail mapper passes these fields through with only ?? null:
turns_attempted: record.turns_attempted ?? null,
turns_completed: record.turns_completed ?? null,
These are written via the same DDB Document-client path that caused Bug #1 (costUsd.toFixed is not a function). Any downstream caller doing arithmetic on
these will crash at runtime.

Fix: Apply coerceNumericOrNull consistently:
turns_attempted: coerceNumericOrNull(record.turns_attempted, 'turns_attempted', logger),
turns_completed: coerceNumericOrNull(record.turns_completed, 'turns_completed', logger),

Cancel hook doesn't short-circuit nudge processing — nudges consumed but never delivered

[Agent Runtime] agent/src/hooks.py:374-385

When the cancel hook fires and sets ctx["_cancel_requested"] = True, the loop continues to _nudge_between_turns_hook, which reads DDB, marks nudges as
consumed, and adds them to _INJECTED_NUDGES. Since cancel takes precedence, these nudges are consumed (DDB state changed) but never actually injected into
the agent's context. The user sees 202 Accepted for their nudge but it silently disappears.

Fix: Check if ctx.get("_cancel_requested"): break after the cancel hook, or have _nudge_between_turns_hook early-return when cancel is set.

CLI/CDK type drift: turns_attempted and turns_completed optionality mismatch

[Type Design + CLI Review] cli/src/types.ts:66-69

CDK TaskDetail declares these as required (number | null), but CLI marks them optional (?). This violates the documented contract that these files must stay
in sync.

Fix: Remove the ? from both fields in cli/src/types.ts.

Important Issues (8 found — should fix)

Unhandled exception in routeEvent crashes entire fanout batch

[CDK Code Review] cdk/src/handlers/fanout-task-events.ts:734

routeEvent is called with no try/catch inside the record loop. If resolveTokenSecretArn throws (intentional for AccessDeniedException), the entire handler
crashes — combined with Issue #1, all records in the batch retry.

progress_writer.py circuit breaker trips on ValidationException — kills all progress writes permanently

[Error Handling] agent/src/progress_writer.py:138-152

The bare except Exception feeds all errors into the same circuit breaker. A persistent schema error (e.g., item > 400KB) trips the breaker for all event
types, silencing the entire progress stream. Transient vs. permanent errors should be distinguished.

Reconciler silently succeeds when ALL transitions fail (e.g., IAM/DDB outage)

[Error Handling] cdk/src/handlers/reconcile-stranded-tasks.ts:261-270

Per-task failures are caught and logged at WARN, but the Lambda returns success. A systemic DDB failure strands all tasks indefinitely with only WARN-level
logs as signal.

Dual _ProgressWriter instances with independent circuit breakers

[Agent Runtime] agent/src/runner.py:240 + agent/src/pipeline.py:303

Two independent writers (runner-level for turn events, pipeline-level for milestones) each have their own failure counter. One can silently disable while the
other continues, creating invisible event gaps.

renderCommentBody not self-defending against uncoerced DDB strings

[Error Handling] cdk/src/handlers/shared/github-comment.ts:329

Defense is entirely at the single fanout call site. A future caller passing raw DDB values will hit the same toFixed is not a function crash.

channel_source typed as string instead of 'api' | 'webhook' literal union

[Type Design] cdk/src/handlers/shared/types.ts:54,158

Loses exhaustiveness checking in downstream switches. The internal CreateTaskContext correctly narrows it but the external types don't.

Python TaskConfig missing @model_validator for trace=True + user_id=""

[Type Design] agent/src/models.py:111-124

This combination causes a runtime failure at S3 upload time rather than at construction. A validator would surface it immediately.

Markdown injection possible via prUrl in GitHub comment body

[CDK Code Review] cdk/src/handlers/shared/github-comment.ts:315-325

While sanitizeEventType strips dangerous characters, prUrl goes unsanitized into a markdown link. A compromised agent writing a crafted URL could break the
table layout or inject content.

…rokoko review aws-samples#1, aws-samples#5, aws-samples#9, aws-samples#12) Four related fixes on the fanout + github-comment surface, from the code review on PR aws-samples#52. Grouped because they share the narrative "defense-in-depth on the fanout dispatcher" — any one landing without the others leaves a hole. ## Findings addressed **aws-samples#1 — Fanout handler returns void despite reportBatchItemFailures: true** The ``FanOutConsumer`` construct (``cdk/src/constructs/fanout-consumer.ts:146``) has ``reportBatchItemFailures: true`` on its DDB Stream event-source mapping. The handler returned ``void``, so Lambda retried the entire batch on any unhandled throw instead of isolating the poisonous record. Combined with aws-samples#5 this could cascade into retry storms and violated the per-task ordering guarantee we rely on (§6.4, AD-9). Fix: handler return type becomes ``Promise<DynamoDBBatchResponse>``. Per-record processing is wrapped in try/catch; caught throws push ``{ itemIdentifier: record.eventID }`` to ``batchItemFailures`` and emit ``fanout.record.failed`` warn. Final ``fanout.batch.complete`` log grows a ``failed`` count. Note: ``DynamoDBStreamHandler`` constrains return to ``void | Promise<void>``, so the handler is typed as a plain 3-arg async function. Lambda's runtime accepts either shape; existing tests (passing ``event, context, cb``) work unchanged. **aws-samples#5 — Unhandled exception in routeEvent crashes batch** ``routeEvent`` uses ``Promise.allSettled`` internally, but ``resolveTokenSecretArn`` can throw ``AccessDeniedException`` SYNCHRONOUSLY before the ``allSettled`` guard is reached. The new per-record try/catch from aws-samples#1 catches these too. **aws-samples#9 — renderCommentBody not self-defending against uncoerced DDB strings** The ``.toFixed(4)`` call on ``costUsd`` is the same bug class as the ``toFixed is not a function`` crash we fixed at the fanout boundary in commit 9fe704e. Today the sole call site coerces via the shared helper; a future caller that forgets to would crash. Fix: ``renderCommentBody`` coerces ``durationS`` and ``costUsd`` internally via the shared ``coerceNumericOrNull`` helper (second line of defense; caller's coercion remains the first). Widened ``CommentBodyInput`` fields to ``number | string | null`` to honestly model the DDB Document-client boundary. **aws-samples#12 — Markdown injection possible via prUrl in GitHub comment body** ``prUrl`` was interpolated directly into a Markdown link target (``[link](${input.prUrl})``). A crafted URL containing ``)`` / ``|`` / ``\n`` could break the table layout or inject content, and a ``javascript:`` scheme could produce a click-to-execute link on some Markdown renderers. Fix: new exported ``sanitizeMarkdownLinkTarget`` helper in ``shared/github-comment.ts`` rejects URLs containing ``\r\n\t\s)|]"<>`` characters, validates via ``new URL()``, and rejects non-http(s) schemes. Returns ``null`` on rejection so ``renderCommentBody`` omits the Pull-request row entirely rather than emitting a broken or unsafe link. ## Tests +22 regression tests net (fanout 7 for aws-samples#1+aws-samples#5 + 3 for aws-samples#9; github-comment 12 for aws-samples#12): - Fanout partial-batch: poison-pill isolation, mixed-batch (good record NOT in failures), observability warn, empty-failures regression guard, baseline pin that today's ``Promise.allSettled`` containment still works. - renderCommentBody numeric self-defense: string-typed values render correctly; non-finite strings collapse to null with warn; null does NOT warn. - sanitizeMarkdownLinkTarget unit tests: accept clean http/https, reject 9 injection patterns, reject 4 non-http schemes (``javascript:``, ``data:``, ``file:``, ``ftp:``), reject malformed, handle null/undefined. Plus end-to-end assertions on ``renderCommentBody`` proving the PR row is omitted on rejection. CDK suite: 1029 passing (was 1001). Refs: krokoko code review on PR aws-samples#52 (findings 1, 5, 9, 12)

…al union (krokoko review aws-samples#2, aws-samples#4, aws-samples#10) Three related type-safety fixes from the code review on PR aws-samples#52. Grouped because they all touch the same contract surface (``TaskDetail`` on both sides of the wire) and share the theme "honest types at the DDB + API boundary". ## Findings addressed **aws-samples#2 — turns_attempted / turns_completed bypass coerceNumericOrNull** ``toTaskDetail`` at ``cdk/src/handlers/shared/types.ts`` was passing ``record.turns_attempted`` and ``record.turns_completed`` through with only ``?? null`` — same bug class as the ``costUsd.toFixed is not a function`` crash fixed at the fanout boundary in commit ``9fe704e`` / consolidated in ``c09bfd7``. The DDB Document-client deserializes ``Number`` attributes as ``string`` on some code paths; any downstream caller doing arithmetic on these fields would crash at runtime. Fix extends beyond the two fields the review called out — ALL numeric fields sourced from the DDB record now route through ``coerceNumericOrNull``: ``duration_s``, ``cost_usd``, ``max_turns``, ``max_budget_usd``, ``turns_attempted``, ``turns_completed``. A new JSDoc block on ``toTaskDetail`` documents the contract ("all numeric fields coerced through shared helper; do not bypass") so a future field addition has a clear pattern to follow. Scope-bounded: the non-numeric fields (``task_description``, ``pr_url``, …) and request-body-validated ints (``issue_number``, ``pr_number``) stay untouched. **aws-samples#4 — CLI/CDK type drift on turns_attempted / turns_completed** Per AGENTS.md the CDK and CLI ``TaskDetail`` declarations must stay in sync. CDK declared both fields as required (``number | null``); CLI marked them optional (``number | null | undefined``). The tightening means the server's guarantee (``toTaskDetail`` always emits both fields, defaulting to ``null`` when absent on the record) now flows honestly to the CLI type. **aws-samples#10 — channel_source typed as string instead of literal union** Added an exported ``ChannelSource = 'api' | 'webhook'`` literal union in both CDK and CLI ``types.ts``. ``TaskRecord.channel_source`` and ``TaskDetail.channel_source`` on both sides now use the narrow type. Downstream switches/predicates that read ``channel_source`` get exhaustiveness checking at compile time, matching the internal ``CreateTaskContext.channelSource`` literal already in use at ``create-task-core.ts``. Reviewer's comment: "the internal CreateTaskContext correctly narrows it but the external types don't" — now they do. ## Tests +3 regression tests in ``cdk/test/handlers/shared/error-classifier.test.ts`` under the ``toTaskDetail integration`` block: - String-typed DDB numeric fields coerce to finite numbers on the ``TaskDetail`` output (``typeof === 'number'`` + exact value for all six coerced fields). - Unparseable numeric strings collapse to ``null`` without crash (defence test for the non-finite branch in ``coerceNumericOrNull``). - ``channel_source`` narrows to the literal union — uses ``@ts-expect-error`` to pin that a widened ``'slack'`` fails to compile, so a future ``ChannelSource`` regression surfaces in CI immediately. CLI test fixtures updated to include ``channel_source: 'api'`` and ``turns_attempted: null`` / ``turns_completed: null`` where the non-optional fields were previously omitted. No CLI test count change — the fixture additions satisfy the stricter contract without requiring new test bodies. CDK suite: 1032 passing (was 1029). CLI suite: 190 passing. Refs: krokoko code review on PR aws-samples#52 (findings 2, 4, 10)

…iew aws-samples#3) Pre-fix, the between-turns-hooks dispatcher at ``agent/src/hooks.py`` ran EVERY registered hook in the list before checking ``ctx["_cancel_requested"]``. The registered order is cancel-hook first, nudge-hook second — so when cancel fired, the nudge hook had already run and: - Read ``TaskNudgesTable`` via ``read_pending``. - Marked the nudge rows ``status=consumed`` via ``mark_consumed``. - Added them to the in-memory ``_INJECTED_NUDGES`` dedup set. But the dispatcher's return value discarded everything when cancel was detected, so the nudge was silently lost. The user's ``bgagent nudge`` had returned 202 Accepted, the DDB row was now ``consumed``, yet the agent never saw the nudge content. A cancelled task leaks a nudge on every invocation. ## Fix — belt-and-braces Two independent guards so the invariant is preserved even if a future refactor reorders the hook registry: 1. **Loop-level break (primary).** The between-turns-hooks dispatcher checks ``ctx.get("_cancel_requested")`` immediately after each hook returns and breaks out of the loop. Cancel-hook runs first by convention, flips the flag, and the nudge-hook never runs. No DDB reads, no status mutation. 2. **Internal early-return in the nudge hook (secondary).** ``_nudge_between_turns_hook`` checks ``ctx.get("_cancel_requested")`` up front (right after the empty-task-id guard, BEFORE any DDB call and BEFORE ``_emit_nudge_milestone`` would write a spurious ``nudge_acknowledged``). Also guards against the pre-loop-cancel case (cancel flag already set when the dispatcher enters). The two guards are independent: if the registry ordering breaks (e.g. Phase 3 prepends an approval hook that flips cancel after the registered cancel hook), the internal early-return still protects against data loss. Hardened the registry-declaration comment at ``agent/src/hooks.py`` to state the ordering is load-bearing ("cancel MUST come before nudge") and documented how future hooks should be appended. Did NOT add a runtime sort — the list literal preserves insertion order deterministically, and forcing an implicit sort would hide real bugs from anyone intentionally reordering. ## Tests +5 regression tests in a new ``agent/tests/test_cancel_hook.py``: - ``test_nudge_hook_not_invoked_when_cancel_fires_first`` — spy hook verifies invocation count is 0 after cancel; dedup set untouched. - ``test_real_nudge_reader_not_touched_on_cancel`` — end-to-end with mocked DDB; asserts neither ``table.query`` nor ``update_item`` is called when cancel fires first. - ``test_preloop_cancel_skips_all_hooks_via_internal_guard`` — the nudge hook invoked directly with ``_cancel_requested=True``; no DDB I/O, no dedup mutation. - ``test_nudge_hook_internal_guard_fires_even_if_registry_reordered`` — asserts ``write_agent_milestone`` is NOT called when cancel is set (no spurious ack milestone for a cancelled turn). - ``test_running_task_nudge_still_consumed_normally`` — negative-control regression: RUNNING tasks still flow through cancel→nudge→inject path with DDB calls firing exactly once. Agent suite: 473 passing (was 468). Refs: krokoko code review on PR aws-samples#52 (finding 3)

…aker (krokoko review aws-samples#6, aws-samples#8) Two related hardening changes on ``agent/src/progress_writer.py``. Grouped because the shared circuit breaker reuses the error- classification decision to know when NOT to flip itself, and separating the commits would force an awkward "half-fix" intermediate state. ## Findings addressed **aws-samples#6 — Circuit breaker trips on ValidationException (permanent errors)** The pre-fix ``except Exception`` branch fed ALL errors into the same ``_failure_count`` counter. A persistent schema/size error (e.g. ``ValidationException`` from an item >400 KB under a trace-heavy event) counted against the transient-failure budget and tripped the breaker within 3 events, silencing the entire progress stream for the rest of the task — even though most subsequent events were normal size and would have written fine. New behaviour classifies each DDB error into three buckets: - **Permanent (drop event, keep stream alive):** ``ValidationException``, ``ItemCollectionSizeLimitExceededException`` — the individual event is malformed or oversized, retrying or treating as transient would not help. Log at WARN, skip the event, do NOT increment the failure counter. - **Immediate-disable (fatal, don't even try to retry):** ``ResourceNotFoundException``, ``AccessDeniedException``, ``UnauthorizedOperation`` — wrong deploy or IAM misconfig. Disable the breaker on the first occurrence instead of waiting for 3 failures; log at WARN with the error code. Avoids spamming operator dashboards with 3 copies of the same permissions error. - **Transient (trip the breaker on repeated failures, as today):** ``ProvisionedThroughputExceededException``, ``RequestLimitExceeded``, ``ServiceUnavailable``, ``InternalServerError``, plus network-layer (``ConnectionError``, ``EndpointConnectionError``, ``ReadTimeoutError``). Same counter semantics as pre-fix. - **Unknown (default conservative):** counted as transient (counter increments) but logged at ERROR with an explicit ``UNKNOWN`` marker so operators notice and can add new codes to the permanent/transient lists. Does NOT instant-disable — over- correcting from pre-fix behaviour would swap one failure mode for another. Uses ``botocore.exceptions.ClientError`` + ``err.response["Error"]["Code"]`` for AWS errors; class-name matching for non-ClientError (network- layer) paths. Helper: ``_classify_ddb_error(exc) -> Literal["permanent", "immediate_disable", "transient", "unknown"]``. **aws-samples#8 — Dual _ProgressWriter instances with independent circuit breakers** Pre-fix, the runner-level writer (turn/tool events at ``runner.py:240``) and the pipeline-level writer (milestones at ``pipeline.py:303``) each held their own ``_failure_count`` / ``_disabled`` state. If throttling tripped one writer, the other kept writing — creating visible event gaps in the stream that operators could not distinguish from agent activity (milestones firing after turn events stop, or vice versa). Fix: consolidate circuit-breaker state into a module-level ``_SharedCircuitBreaker`` singleton keyed by ``task_id``. Both writers for the same task read/write the same ``(_failure_count, _disabled)`` pair through named methods (``is_disabled``, ``record_failure``, ``record_success``, ``disable``). One task's stream is either healthy (all events flow from both writers) or degraded (no events flow from either). Cannot have a half-alive stream. Semantics notes: - ``record_success`` resets the shared counter but NOT the ``_disabled`` flag. Re-enabling mid-task would let a flaky minute burn the failure budget repeatedly and defeat the breaker's purpose. - Empty-string ``task_id`` (``runner.py`` falls back to sentinel ``"unknown"``) collapses to shared state for all ``"unknown"`` writers. Real task_ids stay isolated. - Writers retain ``_disabled`` / ``_failure_count`` as properties that proxy to the shared map. Existing callers (``hooks.py`` does ``getattr(progress, "_disabled", False)``) and tests that assign ``writer._failure_count = 0`` keep working unchanged — no constructor signature change required, no ``runner.py`` / ``pipeline.py`` edits. - Single ``threading.Lock()`` protects the shared map; DDB write rate (single-digit events/sec) never contends meaningfully. - Test hygiene: ``_reset_circuit_breakers()`` helper rebinds the module global so autouse fixtures give each test a clean slate. ## Tests +24 regression tests net (36 → 60 in ``test_progress_writer.py``). Coverage: - Finding aws-samples#6 classification: ``test_permanent_error_does_not_trip_breaker`` (10 consecutive ``ValidationException`` writes keep ``_disabled=False``), ``test_transient_error_trips_breaker_as_before``, ``test_access_denied_disables_writer_immediately_with_loud_log``, ``test_unknown_exception_treated_as_transient_with_error_log``. - Finding aws-samples#8 sharing: ``test_shared_circuit_breaker_across_writers_same_task_id`` (writer-A trips the breaker; writer-B sees ``is_disabled`` and skips the DDB call), ``test_separate_tasks_have_independent_breakers``, ``test_unknown_sentinel_task_id_is_isolated``, ``test_reset_helper_clears_shared_state_between_tests``. - Plus edge cases: success-interleave resets the counter across writers; ``_disabled`` stays open after re-enabling a success mid-task; thread-safety via concurrent writes. Agent suite: 497 passing (was 473; +24). Refs: krokoko code review on PR aws-samples#52 (findings 6, 8)

…-user_id validator (krokoko review aws-samples#7, aws-samples#11) Two small hardening changes paired under the theme "surface silent failures at construction time / at run time". Both make existing latent bugs visible instead of silently breaking at the worst possible moment. ## Findings addressed **aws-samples#7 — Reconciler silently succeeds when ALL transitions fail** ``reconcile-stranded-tasks`` previously logged per-task failures at WARN and returned success unconditionally. A systemic failure (DDB throttling at the shard level, IAM outage, schema corruption) could strand 100% of candidates — and the only signal was a WARN log per task plus a final INFO that looked like a healthy run. Operators dashboarding "reconciler completed" counts would not notice the outage. New behaviour classifies the run result into three cases and picks a log level accordingly: - ``stranded > 0 AND failed == 0 AND errors > 0`` → ERROR with ``error_id: 'RECONCILER_TOTAL_FAILURE'``. Systemic failure; alarm on the error_id string. - ``errors > 0 AND failed > 0`` → WARN with ``error_id: 'RECONCILER_PARTIAL_FAILURE'``. Dashboards signal; not alarm-worthy on its own. - Otherwise → INFO, as today. The handler still does NOT throw — event-source-mapping invocations complete normally. The log-level escalation IS the alarm signal, matching the ``error_id`` convention already used in ``fanout-task-events.ts`` (``FANOUT_GITHUB_PERSIST_FAILED``). **aws-samples#11 — TaskConfig missing @model_validator for trace=True + user_id=""** The trace trajectory is uploaded to ``traces/<user_id>/<task_id>.jsonl.gz`` (design §10.1), and the ``get-trace-url`` handler refuses presigned keys outside the caller's own ``traces/<user_id>/`` prefix. Pre-fix, a TaskConfig built with ``trace=True`` and an empty ``user_id`` sentinel would construct fine and fail later at S3 upload time — mid-task, when the agent had already paid the cost of running. Added a ``@model_validator(mode='after')`` on ``TaskConfig`` that raises a descriptive ``ValueError`` when ``trace=True`` and ``user_id`` is empty. Construction fails immediately; local/dev misconfigurations surface before the agent wastes tokens. The error message cites design §10.1 + the get-trace-url handler's prefix guard so the remedy is clear without cross-referencing other files. ## Tests **CDK reconciler (+4 tests):** - Total-failure case logs ERROR + ``RECONCILER_TOTAL_FAILURE``. - Partial-failure case logs WARN + ``RECONCILER_PARTIAL_FAILURE``. - Full-success case logs INFO (happy-path regression). - Empty-candidate case logs INFO (not alarming — absence of stranded tasks is the target state). CDK suite: 1036 passing (was 1032). **Agent TaskConfig (+3 tests, +2 pipeline realignments):** - ``test_trace_true_with_empty_user_id_raises_at_construction`` — validator fires immediately with the documented message fragment. - ``test_trace_true_with_valid_user_id_constructs_cleanly`` — happy-path regression. - ``test_trace_false_allows_empty_user_id`` — negative control; local/batch runs without an orchestrator still work as long as they do not opt into trace capture. - Two existing ``test_pipeline.py`` tests constructed ``TaskConfig(trace=True, user_id="")`` directly. These now either (a) pass a real ``user_id`` for the happy path, or (b) assert that construction raises — the tightened contract is strictly stronger than the previous "defensive-at-upload skip". Agent suite: 500 passing (was 498; +3 new, −1 obsolete, +0 net from pipeline realignment). Refs: krokoko code review on PR aws-samples#52 (findings 7, 11)

…s callback-style signatures The partial-batch-response fix in commit ``1b8ffa6`` typed the FanOutConsumer handler as a 3-arg async function (``handler(event, _context?, _callback?)``) to keep existing test call sites — which pass ``handler(event, {} as never, () => undefined)`` — working without a sweeping test edit. nodejs24.x's Lambda runtime detects any 3-arg handler shape as callback-style at init and rejects it with ``Runtime.CallbackHandlerDeprecated``. Every invocation of the FanOutConsumer since redeploy at 2026-05-05 18:46 UTC crashlooped, observed in CloudWatch: Runtime.CallbackHandlerDeprecated: ERROR: AWS Lambda has removed support for callback-based function handlers starting with Node.js 24. You need to modify this function to use a supported handler signature to use Node.js 24 or later. End-to-end impact: zero DDB-stream events processed; no comment edits fired on any task since the redeploy. The partial-batch code the commit was supposed to enable was never reached. Fix: drop ``_context`` and ``_callback`` from the handler signature. Handler is now ``handler(event): Promise<DynamoDBBatchResponse>``. Tests that invoked with trailing args (``handler(event, ctx, cb)``) are stripped to ``handler(event)`` by ts-jest's strict arity check — JS itself would silently ignore extras, but TS2554 surfaces at compile time. Tests: 74 pass in ``fanout-task-events.test.ts`` (no count change; arity-only edits). Full CDK suite: 1036 passing. Refs: integration-pass report on commit ``331e283`` redeploy; original handler signature introduced in ``1b8ffa6``.

scoropeza · 2026-05-05T20:32:33Z

Thanks @krokoko for the thorough 12-finding review — this was exactly the kind of scrutiny the change needed. All 12 are now addressed across 5 thematic commits plus one hotfix caught by integration re-testing.

Finding → commit map

#	Finding	Commit
1, 5, 9, 12	Fanout partial-batch retry + GitHub-comment defense-in-depth	`1b8ffa6`
2, 4, 10	`TaskDetail` numeric coercion + `ChannelSource` literal union	`d6ad9a5`
3	Cancel hook short-circuits nudge consumption	`9e6c23f`
6, 8	Progress writer error classification + shared circuit breaker	`db55bfa`
7, 11	Reconciler alarming + `TaskConfig` trace-without-user_id validator	`331e283`
—	Hotfix: drop unused `_context`/`_callback` args (nodejs24 rejects callback-style signatures)	`d8a98d6`

Notes on the hotfix

The hotfix landed after integration re-testing of the partial-batch fix. Under nodejs24.x, Lambda rejects the legacy 3-arg (event, context, callback) handler signature and throws Runtime.CallbackHandlerDeprecated on every invocation. The fanout consumer was crashlooping silently — visible only in CloudWatch. Switching to the 2-arg async (event) form restored it. Post-hotfix CloudWatch shows zero CallbackHandlerDeprecated errors and fanout.batch.complete reports failed: 0.

Updated test counts

CDK: 1001 → 1036
CLI: 190 (unchanged)
Agent: 468 → 500

All suites green at tip 7e170da.

Live integration validation

Post-hotfix re-run against a deployed dev stack confirmed:

Fanout partial-batch: fanout.batch.complete with failed: 0; POST + PATCH dispatched against the same comment_id across pr_created → task_completed; zero fanout.dispatcher.rejected.
Cancel-vs-nudge: task_cancelled observed without a corresponding nudge_acknowledged when cancel won the race.
Progress writer: no regressions in error classification path; circuit breaker holds.
Reconciler alarming: INFO with errors: 0 on cold-start sweep.

Ready for another pass when you have time.

…ix set CI (build workflow) failed on the pushed review-fix commits with: - ruff E501 line too long in progress_writer.py (extracted set literal to named constant) - ruff RUF100 unused noqa directives (removed) - ty unresolved attribute on Exception.response in progress_writer.py and test_progress_writer.py (refactored to type-narrowed flow; used setattr in test helper) - eslint no-require-imports in reconcile-stranded-tasks.test.ts (added rule to existing disable comment) No behavioral changes; all 500 agent tests + 1036 CDK tests pass.

scoropeza and others added 30 commits April 28, 2026 13:53

docs(implementation): add Phase 1a implementation prompt

4dbc894

Detailed prompt for implementing DDB progress events + CLI watch command. Covers: ProgressWriter, entrypoint integration, CLI watch command, tests, and constraints. Ready for a fresh agent session.

bgagent added 6 commits May 4, 2026 15:33

scoropeza changed the title ~~feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only)~~ feat(interactive-agents): async-only background task UX + Phase 3 Cedar HITL design May 4, 2026

scoropeza changed the title ~~feat(interactive-agents): async-only background task UX + Phase 3 Cedar HITL design~~ feat(interactive-agents): async-only background task UX + Cedar HITL design May 4, 2026

bgagent added 7 commits May 5, 2026 10:05

docs(diagrams): updated formatting of edges

7e170da

krokoko approved these changes May 5, 2026

View reviewed changes

krokoko added this pull request to the merge queue May 5, 2026

Merged via the queue into aws-samples:main with commit 0dca217 May 5, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(interactive-agents): async-only background task UX + Cedar HITL design#52

feat(interactive-agents): async-only background task UX + Cedar HITL design#52
krokoko merged 88 commits intoaws-samples:mainfrom
scoropeza:feature/interactive-background-agents

scoropeza commented Apr 29, 2026 •

edited

Loading

Uh oh!

scoropeza commented May 4, 2026

Uh oh!

krokoko commented May 5, 2026

Uh oh!

scoropeza commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scoropeza commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships (deployed + tested)

Polling / observability

Interactive steering

Trace / debug

Notification plane (fanout)

Webhook submission

Design-only (not implemented)

Phase 3 — Cedar-driven HITL approval gates

Architectural highlights

Design doc

Deploy validation

Test plan

Known limitations / follow-ups

Reviewers — suggested reading order

Uh oh!

scoropeza commented May 4, 2026

Deploy validation caught three real runtime bugs (all fixed on-branch)

Hands-on testing — 15 scenarios against scoropeza/agent-plugins

Test baselines updated

Phase 3 Cedar HITL — design + plumbing only, no engine

Follow-ups

Uh oh!

krokoko commented May 5, 2026

Uh oh!

scoropeza commented May 5, 2026

Finding → commit map

Notes on the hotfix

Updated test counts

Live integration validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scoropeza commented Apr 29, 2026 •

edited

Loading

Hands-on testing — 15 scenarios against `scoropeza/agent-plugins`