Skip to content

Runner error handling#2093

Open
AndreiCravtov wants to merge 38 commits into
mainfrom
andrei/error-handling
Open

Runner error handling#2093
AndreiCravtov wants to merge 38 commits into
mainfrom
andrei/error-handling

Conversation

@AndreiCravtov
Copy link
Copy Markdown
Collaborator

@AndreiCravtov AndreiCravtov commented May 14, 2026

Runner error handling

Motivation

Runner failures were mostly surfaced as plain shutdown messages, which made root cause hard to spot from API errors or runner status.

This adds a MVP path for preserving runner crash context and attaching known stderr diagnostics to failure reports.

Changes

  • Added RunnerTerminationError for Python exceptions raised inside runner bootstrap
  • Changed runner bootstrap to send Event | RunnerTerminationError over the private runner channel
  • Moved public RunnerFailed emission back into supervisor
  • Added stderr-only RunnerDiagnosticCollector
    • Added known diagnostics for Metal GPU timeout, ring socket receive errno, and ring transport abort
  • Added diagnostics to RunnerFailed and ErrorChunk
  • Tweaked async process termination to join briefly before terminate/kill
  • Updated tests/fixtures for new failure payload shape
  • Added Ruff VS Code formatter settings

Why It Works

Runner child now reports raw-ish failure context to supervisor instead of publishing failed status directly.

Supervisor still owns process lifecycle, exit code/signal handling, in-flight task error chunks, and final runner status. Stderr diagnostics stay best effort and only known root-cause variants are surfaced.

Test Plan

Manual Testing

Hardware: remote runner logs from e16/e11/e4/e2

What you did:

  • inspected live runner stderr logs
  • used observed Metal GPU timeout and ring socket errors as initial diagnostic targets

Automated Testing

  • nix flake check
  • supervisor test covers error chunk + failed status emission
  • plan lifecycle test updated for failed runner diagnostics
  • type/lint checks cover new runner channel union

@AndreiCravtov AndreiCravtov requested a review from Evanev7 May 14, 2026 16:40
@AndreiCravtov AndreiCravtov requested a review from rltakashige May 14, 2026 16:45
@AndreiCravtov AndreiCravtov requested a review from ciaranbor May 14, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant