Runner error handling by AndreiCravtov · Pull Request #2093 · exo-explore/exo

AndreiCravtov · 2026-05-14T16:36:04Z

Runner error handling

Motivation

Runner failures were mostly surfaced as plain shutdown messages, which made root cause hard to spot from API errors or runner status.

This adds a MVP path for preserving runner crash context and attaching known stderr diagnostics to failure reports.

Changes

Added RunnerTerminationError for Python exceptions raised inside runner bootstrap
Changed runner bootstrap to send Event | RunnerTerminationError over the private runner channel
Moved public RunnerFailed emission back into supervisor
Added stderr-only RunnerDiagnosticCollector
- Added known diagnostics for Metal GPU timeout, ring socket receive errno, and ring transport abort
Added diagnostics to RunnerFailed and ErrorChunk
Tweaked async process termination to join briefly before terminate/kill
Updated tests/fixtures for new failure payload shape
Added Ruff VS Code formatter settings

Why It Works

Runner child now reports raw-ish failure context to supervisor instead of publishing failed status directly.

Supervisor still owns process lifecycle, exit code/signal handling, in-flight task error chunks, and final runner status. Stderr diagnostics stay best effort and only known root-cause variants are surfaced.

Test Plan

Manual Testing

Hardware: remote runner logs from e16/e11/e4/e2

What you did:

inspected live runner stderr logs
used observed Metal GPU timeout and ring socket errors as initial diagnostic targets

Automated Testing

nix flake check
supervisor test covers error chunk + failed status emission
plan lifecycle test updated for failed runner diagnostics
type/lint checks cover new runner channel union

# Conflicts: # uv.lock

AndreiCravtov added 30 commits May 11, 2026 19:19

initial changes

ad2c1e7

integrated it

9c7908f

fmt

d58ab11

fmt

197cbf4

fmt

dbd5ca0

fmt

278bea1

Merge branch 'main' into andrei/error-handling

ab17232

yes

2f9ba7d

yes

9686c02

yes

c3aa1f4

yes

1c7f514

yes

1d17f4d

yes

4ea2532

yes

9636e86

yes

163ef39

yes

32155bd

yes

49d7555

yes

9c4cc08

yes

3fc56b2

yes

6c13b45

yes

2969b7d

yes

faafc6f

ff

b1d09b0

ff

36b1b82

ff

547b58b

ff

3611c33

fff

feef63b

fff

ddc6e95

ff

cd22516

ff

9ec09ec

AndreiCravtov added 5 commits May 14, 2026 17:12

fff

88d4c37

fff

2f50616

fff

bab2457

ffff

86398a2

Merge branch 'main' into andrei/error-handling

38258fb

# Conflicts: # uv.lock

AndreiCravtov requested a review from Evanev7 May 14, 2026 16:40

Merge branch 'main' into andrei/error-handling

fbe962b

AndreiCravtov requested a review from rltakashige May 14, 2026 16:45

AndreiCravtov added 2 commits May 14, 2026 17:55

yes

1649708

yes

42b26af

AndreiCravtov requested a review from ciaranbor May 14, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner error handling#2093

Runner error handling#2093
AndreiCravtov wants to merge 38 commits into
mainfrom
andrei/error-handling

AndreiCravtov commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AndreiCravtov commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runner error handling

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AndreiCravtov commented May 14, 2026 •

edited

Loading