Skip to content

OpenClaw Windows Companion App Setup Robustness Improvements#523

Open
RBrid wants to merge 12 commits into
openclaw:masterfrom
RBrid:user/regisb/WSLInstall1
Open

OpenClaw Windows Companion App Setup Robustness Improvements#523
RBrid wants to merge 12 commits into
openclaw:masterfrom
RBrid:user/regisb/WSLInstall1

Conversation

@RBrid
Copy link
Copy Markdown
Contributor

@RBrid RBrid commented May 23, 2026

Summary

Make the local-gateway setup wizard work end-to-end on a clean Windows machine — including auto-installing the WSL platform when it's missing — so first-run users get from "Set up locally" to a paired node without manual wsl --install + reboot rituals or terminal commands.

End-state validation: full 7-step onboarding succeeds first-try on a clean ARM64 box, with no manual intervention.

What this PR delivers

1. Elevated WSL platform install (the core feature)

  • New WslPlatformProbe (locale-tolerant wsl --status parser anchored on the aka.ms/wslinstall URL) detects "platform missing" vs "platform present but unhealthy" without hanging on the default 30s WSL runner timeout — uses an independent 5s probe ceiling.
  • New ElevatedWslPlatformInstaller shells out to wsl --install --no-distribution via ShellExecute "runas" (UAC prompt). Recognises ERROR_SUCCESS_REBOOT_REQUIRED (3010), retries the post-install probe to ride out the lifted-WSL warmup race (typically 1–2 probes, worst case ~33s on slow ARM64), and surfaces Cancelled cleanly when the user declines UAC.
  • Engine wiring (LocalGatewaySetup.cs): preflight emits wsl_platform_not_installed as a Warning (not blocking); EnsureWslEnabled phase consumes the warning, runs the installer, then re-runs preflight. Self-heal logic resets stale Status=RequiresRestart / RequiresAdmin / Cancelled / FailedTerminal / Blocked entries on engine entry so the wizard can re-evaluate the host after the user reboots / re-launches.
  • New WslInstallCycleIssueCodes static class extracts the 7 cycle-related issue codes from string literals into named constants — used in 5 sites for the self-heal RemoveAll + outcome-mapping in EnsureWslEnabled.

2. Onboarding-window suppression of manager-side node auto-approve

During the V2 onboarding engine's Phase-14 (PairWindowsTrayNode), the engine drives canonical role-upgrade approval via the WSL CLI (openclaw devices approve). The GatewayConnectionManager's own RPC-based node.pair.approve auto-approver would race against that with the wrong requestId namespace (DEVICE-pair requestIds surface on node-pair pending events during role-upgrade), causing the gateway to "unknown requestId" and then service-restart mid-pair.

  • IGatewayConnectionManager.SuppressNodeAutoApprove (bool, process-global, racy) → AcquireNodeAutoApproveSuppression(): IDisposable. Reference-counted via Interlocked CAS; tokens are idempotent (Interlocked.Exchange(ref _owner, null) in Dispose); over-release is a no-op (saturating decrement). Scope is gated to local-loopback gateways only via IsLocalLoopbackGatewayUrl(url) (proper Uri.IsLoopback parse — rejects ws://localhost.evil.example.com, ws://0.0.0.0, accepts ws://[::1] and decimal-encoded loopback).
  • App.xaml.cs.ShowOnboardingAsync: acquires a suppression token for the onboarding window's lifetime; releases it from both OnboardingCompleted and Closed handlers (idempotent). Release happens before _onboardingWindow=null so observers using window-null as a "finished" signal also observe suppression released. Wire-up + Activate() are wrapped in try/catch so a throw can't leak the token; the catch path also restores the previously-connected gateway if onboarding had disconnected it (otherwise the user would be silently stranded since ShowOnboardingAsync is called fire-and-forget).

3. State-machine reset on retry (fixes a real Phase-14 false-failure)

Surfaced by manual ARM64 testing:

  • Symptom: First-run Phase 14 showed red X with "Pairing failed", but the actual pairing succeeded ~25ms later (logs: Received device token - we are now paired!).
  • Root cause: After the first connect got PAIRING_REQUIRED (role-upgrade) and the engine approved it via WSL CLI, the retry connect's StartNodeConnectionAsync was kicked off but the state machine still showed NodeState=PairingRequired from the previous attempt. The retry's first event (SimulateStatus(Connected)) didn't transition (NodeConnected rejected from PairingRequired); EmitStateChanged still fired with the stale snapshot; EnsureNodeConnectedAsync's Handler tripped the suppression-fast-fail on the stale state.
  • Fix: ConnectionStateMachine.StartNodeConnecting() now also resets from PairingRequired/PairingRejected/RateLimited/Error/Idle (was only Idle/Error). RebuildSnapshot clears NodePairingRequestId and resets NodePairingStatus to Unknown on the way out of PairingRequired. StartNodeConnectionAsync now calls StartNodeConnecting() so every new attempt starts from a clean slate.
  • Also added RateLimited to both EnsureNodeConnectedAsync's Handler switch and its synchronous isTerminal gate (latent — NodeRateLimited is not currently wired, but the moment it gets wired the wait would hang the 35s default).

4. V2 onboarding bridge

  • OnEngineStateChanged now surfaces error message for RequiresAdmin (was only RequiresRestart + FailedRetryable/Terminal/Blocked).
  • LocalizeFailureMessage now returns (string?, out bool isGenericFallback); MaybeAppendSetupDiagnosticsHint skips the "Setup diagnostics: …" wrap when the V2_Progress_GenericFailure placeholder was used (already directs to logs; double-hint clutters the small error card).
  • ComputeInfoMessage phase predicate tightened from phase is NotStarted or Preflight or EnsureWslEnabled to status == Running && phase <= EnsureWslEnabled so the generic "Check system" hint stops once the engine moves past EnsureWslEnabled (downstream phases own their own labels).
  • Added the new wsl_* failure codes to LocalizeFailureMessage and corresponding strings to all five locale resw files (en-us, fr-fr, nl-nl, zh-cn, zh-tw).

5. Smaller engine fixes

  • LooksLikePostInstallKernelIssue(detail, postFreshInstall) overload — when caller knows WSL was just installed in this session AND result.Detail is blank (common: WslRegisterDistribution HRESULTs go to the Windows event log, not stderr), default to true so the user gets the correct "restart your PC" recommendation. Strict signature match still wins for known-unrelated failures (apt/script/permission). Reset _wslPlatformJustInstalled at the top of every RunLocalOnlyAsync.
  • OPENCLAW_PAIR_RETRY_DELAY_MS env var override on SettingsWindowsTrayNodeProvisioner ctor (clamped to 60s upper bound; warn on unparseable values).
  • CreateWslInstance phase now routed through the same _wslPlatformJustInstalled + LooksLikePostInstallKernelIssue reboot-remap as ConfigureWslInstance (symmetric: kernel-state errors hit during instance install on fresh WSL surface as "restart your PC" instead of "Couldn't download Ubuntu").
  • Preflight: host-side port-probe (_portProbe.IsPortAvailable) now runs unconditionally — only the WSL-side IsExistingGatewayPortAsync recovery path is gated on platform.State == Installed. Users find out about host-side port conflicts before spending 30+ seconds on WSL install.
  • SetupExistingGatewayClassifier: ClassifyAsync accepts optional IOpenClawLogger; Unknown platform state is now treated separately from NotInstalled (flaky probe no longer mislabels setup as AppOwnedLocalWsl from local-setup evidence alone).
  • ElevatedWslPlatformInstaller.RunDefaultAsync catches OperationCanceledException, attempts best-effort Kill(entireProcessTree:true), rethrows — documented limitation that the elevated child runs under a different token and may complete in the background.

6. Process / hygiene

  • Dropped volatile on _suppressNodeAutoApproveCount (fixes CS0420 — Interlocked already provides full barriers); bare reads use Volatile.Read for explicit acquire fence.

Test coverage

All behavioural changes are pinned by regression tests. New / modified test files:

  • tests/OpenClaw.Connection.Tests/IsLocalLoopbackGatewayUrlTests.cs (NEW, 27 theory rows) — pins URI-parsing replacement against hostile lookalikes, wildcards, IPv6, decimal-encoded loopback, percent-encoded host.
  • tests/OpenClaw.Connection.Tests/GatewayConnectionManagerTests.cs (+102 lines) — EnsureNodeConnectedAsync_StaleSnapshotAfterPrevPairingRequired_DoesNotFastFailNewAttempt (Phase-14 false-fail regression) + EnsureNodeConnectedAsync_FreshPairingRequiredAfterReset_FastFailsWhenSuppressed (companion pinning the suppression-fast-fail path still works).
  • tests/OpenClaw.Tray.Tests/WslPlatformInstallTests.cs (+945 lines) — covers WslPlatformProbe (locale-tolerant banner match, status timeout fast-fail, RecordingWslCommandRunner / HangingWslCommandRunner), ElevatedWslPlatformInstaller (RequiresRestart exit code, NotInstalled-after-success → RequiresRestart, Unknown post-probe → RequiresRestart, clamp-attempts, finalize-race retry, RequiresAdmin → RequiresAdmin not Failed, post-reboot resume of persisted state, self-heal of Cancelled/FailedTerminal/Blocked/RequiresAdmin, Engine_ConfigureFails_AfterJustInstalled_DoesNotRemapUnrelatedFailure, LooksLikePostInstallKernelIssue strict-signature theory rows, preflight port-probe unconditional, preflight timeout fast-fail).
  • tests/OpenClaw.Tray.Tests/WindowsTrayNodePairingApprovalTests.cs (+75 lines) — PairAsync_AwaitsConfiguredDelay, PairAsync_CancellationDuringDelay_Propagates.

Validation

  • ./build.ps1 — green
  • dotnet test tests/OpenClaw.Shared.Tests/...1891 pass / 29 skipped
  • dotnet test tests/OpenClaw.Tray.Tests/...1235 pass
  • dotnet test tests/OpenClaw.Connection.Tests/...262 pass
  • dotnet publish ... -r win-arm64 --self-contained — green
  • Manual ARM64 smoke — clean machine, full 7-step onboarding succeeds first-try.

Files changed

Major source-code surface:

  • src/OpenClaw.Connection/GatewayConnectionManager.cs (+208 lines)
  • src/OpenClaw.Connection/IGatewayConnectionManager.cs (+27 lines)
  • src/OpenClaw.Connection/ConnectionStateMachine.cs (+17 lines)
  • src/OpenClaw.Tray.WinUI/App.xaml.cs (+171 lines)
  • src/OpenClaw.Tray.WinUI/Onboarding/V2/OnboardingV2Bridge.cs (+182 lines)
  • src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs (+588 lines)
  • src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/WslPlatformInstall.cs (+406 lines, new)
  • src/OpenClaw.Tray.WinUI/Services/SetupExistingGatewayClassifier.cs (+53 lines)
  • src/OpenClaw.Tray.WinUI/Strings/{en-us,fr-fr,nl-nl,zh-cn,zh-tw}/Resources.resw (+53 each)
  • src/OpenClawTray.OnboardingV2/Pages/LocalSetupProgressPage.cs (+78)
  • src/OpenClawTray.OnboardingV2/V2Strings.cs (+37)
  • docs/ONBOARDING_WIZARD.md (+13), docs/WINDOWS_NODE_TESTING.md (+9)

Total: 27 files changed, +3158 / −136 lines.

Regis Behmo and others added 12 commits May 21, 2026 13:24
Detects the 'Windows Subsystem for Linux is not installed' case in <2s during preflight via a new WslPlatformProbe (anchors on aka.ms/wslinstall URL + falls back to English banner + cheap wsl.exe-on-disk check), short-circuiting the otherwise-30s wsl --list --verbose probe that was causing the ~1-minute hang on hosts without WSL.

When the platform is missing, preflight now emits a non-blocking wsl_platform_not_installed Warning instead of a generic blocking wsl_unavailable, so the engine proceeds past preflight. The existing EnsureWslEnabled phase (previously a no-op) runs wsl.exe --install --no-distribution elevated via ShellExecute (UAC prompt) through a new IWslPlatformInstaller. Outcomes are classified by re-probing after exit:

- InstalledNoRestart -> engine continues to CreateWslInstance
- InstalledRequiresRestart (exit 3010 or probe-still-missing on exit 0) -> Status=RequiresRestart, FailureCode=wsl_install_requires_restart
- UserDeclinedElevation -> retryable wsl_install_elevation_declined
- Failed -> retryable wsl_install_failed with stderr tail

No new LocalGatewaySetupPhase enum value (avoids shifting ordinals that would break persisted setup-state.json and the LocalSetupProgressStageMap ordering). The install runs under the existing 'Check system' wizard stage.

Docs updated in ONBOARDING_WIZARD.md and WINDOWS_NODE_TESTING.md.

Validation: build.ps1 OK, Shared.Tests 1803 passed/28 skipped, Tray.Tests 1171 passed (was 1149 baseline + 22 new tests covering probe banner matching, installer outcome classification, and engine EnsureWslEnabled wiring for each outcome).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Real-world testing on a Cadmus ARM64 host without WSL installed surfaced a
sequence of issues the initial implementation didn't cover. Each fix below
was driven by a specific log + screenshot from those test rounds, then
narrowed at the root.

Engine / setup flow
- HasDistroAsync + SetupExistingGatewayClassifier.HasAppOwnedLocalWslGateway
  now short-circuit on a fast WslPlatformProbe so the "Set up locally" click
  path no longer hangs ~30s on the wsl --list 30s timeout when WSL is missing.
- Engine self-heal generalised: on RunLocalOnlyAsync entry, any persisted
  non-Pending/Running/Complete status (RequiresRestart, FailedRetryable,
  FailedTerminal, Blocked, Cancelled, RequiresAdmin) is reset to Pending +
  NotStarted so Try-again and post-reboot relaunches actually retry instead
  of replaying a cached failure.
- OpenClawCliGatewayServiceManager.InstallAsync now stops any pre-existing
  openclaw-gateway service before the port-conflict probe, so a Try-again
  after a half-completed run doesn't fail on its own previous gateway.
- 3s pause before the wsl --install UAC popup so the user has time to read
  the inline "Windows will ask for permission" hint (skippable via env var,
  excluded in test builds).
- _wslPlatformJustInstalled flag turns ConfigureWslInstance failures right
  after a fresh wsl --install into a specific "Windows Subsystem for Linux
  was just installed - restart your PC" error code instead of the generic
  "follow aka.ms/wsllogs" message.
- WslStoreInstanceInstaller emits wsl_instance_install_no_network when
  wsl --install <distro> output matches known Microsoft Store / offline
  HRESULTs, with a friendlier message.
- WslPlatformProbe bounds wsl --status with its own 5s timeout via linked
  CTS so a hung wsl.exe can't block the wizard for the engine-wide 30s.

Connection manager (root-cause for the pairing cascade)
- New IGatewayConnectionManager.SuppressNodeAutoApprove flag, set true
  while OnboardingWindow is open. While set, the manager skips its own
  operator-side node.pair.approve RPC (which was the wrong RPC for
  role-upgrade pendings: those carry a DEVICE-pair requestId, so the
  call hits "unknown requestId" and triggers a gateway service restart).
- EnsureNodeConnectedAsync now throws immediately on RoleConnectionState
  PairingRequired when SuppressNodeAutoApprove is set, so the V2 engine
  catches and drives `openclaw devices approve` via the WSL CLI well
  inside the gateway's ~10s pending-approval window. Without this the
  35s default wait guaranteed the gateway was already gone by the time
  the engine got control to approve.
- App.OnGatewayConnectionStatusChanged suppresses its post-operator-
  connect TryConnectLocalNodeServiceAsync auto-fire while the V2
  onboarding engine is alive so a single sequenced node-pair runs in
  the foreground.

V2 wizard UI
- BuildStageRow now embeds the optional inline hint AND inline error card
  INSIDE the row container instead of as sibling VStack children. Fixes
  FunctionalUI position-based reconciliation reusing the previous
  render's BorderElement (background brush) when the failed/running row
  index shifts between renders. This is what produced the "blue box
  around Installing Ubuntu" and "pink box around random rows" the user
  reported across several rounds.
- InvisibleBadge explicitly sets Background(Transparent) so the
  reconciler can't leak a previous render's green CheckmarkBadge or
  pink ErrorBadge fill onto an idle row (the "phantom dots").
- ComputeStageState + ShouldShowErrorRow now handle RequiresRestart so
  the offending stage shows a clear failure marker + error card with
  the reboot instruction instead of an infinite spinner.
- LocalSetupProgressPage seeds CheckSystem=Running + a localised
  "Checking your PC; Windows will ask for permission to install WSL"
  hint BEFORE the engine starts so the user gets immediate feedback
  while preflight runs.

Localisation
- 14 new V2_Progress_* keys in V2Strings.cs and all five .resw locales
  (en-us, fr-fr, nl-nl, zh-cn, zh-tw): CheckSystem_Hint, Wsl_Installing,
  Wsl_RequiresRestart, Wsl_Failed, Wsl_ElevationDeclined, Wsl_Unavailable,
  Wsl_NotResponding, Wsl_FirstBootAfterInstall, Wsl_NoNetwork,
  Wsl_ConfigFailed, Wsl_InstanceInstallFailed, Preflight_NotReady,
  OpenClawInstallFailed, GatewayPortInUse, Node_PairingFailed,
  GenericFailure.
- OnboardingV2Bridge.LocalizeFailureMessage maps every known engine
  FailureCode that can plausibly surface during local setup to one of
  the resw keys, with a generic "Setup failed. See logs for details"
  fallback so an empty engine UserMessage can never render an empty red
  card.

Tests
- WslPlatformInstallTests: probe / installer / engine wiring coverage,
  including locale-stable banner matching, post-install probe retry
  finalize-race, UAC decline, generic-exception classification, fast-
  fail on probe timeout, and engine resume after a stale RequiresRestart
  state.
- SettingsWindowsTrayNodeProvisioner: injectable pairRetryDelay/delayAsync
  so the role-upgrade approve+retry path stays fast in tests.
- LocalSetupProgressStageMapTests updated for RequiresRestart -> error
  row visibility.
- All up: Shared.Tests 1813/28 skipped, Tray.Tests 1208, Connection.Tests
  229.

Docs
- ONBOARDING_WIZARD.md documents the new InstallWslPlatform behaviour
  under the existing Check system stage.
- WINDOWS_NODE_TESTING.md adds a manual smoke for the fresh-Windows-
  without-WSL case + the "Set up locally stalls / This PC is not ready"
  troubleshooting path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bring in upstream master (origin/master @ ece297a, 64 commits) so the WSL
local-setup wizard work stays current. Resolved conflicts in 3 files:

- docs/WINDOWS_NODE_TESTING.md — kept both the new 'Set up locally stalls'
  troubleshooting section and the upstream 'Local sandbox validation' section.
- src/OpenClaw.Tray.WinUI/Onboarding/V2/OnboardingV2Bridge.cs — composed
  localized failure message and the new AppendSetupDiagnosticsHint helper
  (errorMessage = AppendSetupDiagnosticsHint(LocalizeFailureMessage(code, msg))).
  Kept the new RequiresRestart entry in the status check.
- src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs —
  threaded both the new wslPlatformInstaller and diagnosticsSink ctor params,
  preserved the engine self-heal + WslPlatformProbe-gated HasDistroAsync,
  and kept the upstream _diagnostics.RunStarted call in RunLocalOnlyAsync.
  Factory now passes both wslPlatformInstaller and diagnosticsSink.

Validation: build OK, Shared.Tests 1891/29 skipped, Tray.Tests 1212,
Connection.Tests 229.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from dual-model code review on user/regisb/WSLInstall1 branch.

openclaw#1 + openclaw#2 + openclaw#6: SuppressNodeAutoApprove (process-global bool, racy, idempotency-fragile) replaced with IDisposable AcquireNodeAutoApproveSuppression() using volatile int + Interlocked counter. Token Dispose is idempotent via Interlocked.Exchange. Local-loopback URL gate scopes suppression to ws://localhost / ws://127.* so manual operator connects to remote gateways still surface their pending-approval toast.

openclaw#3: ElevatedWslPlatformInstaller.RunDefaultAsync now wraps WaitForExitAsync in try/catch(OperationCanceledException), attempts best-effort Kill(entireProcessTree:true) on the launcher (will fail for the elevated child — documented), then rethrows so the engine surfaces Cancelled cleanly.

openclaw#4: ConfigureWslInstance failure remap to wsl_firstboot_config_failed_after_install is now gated on a stderr-signature check (LooksLikePostInstallKernelIssue: WslRegisterDistribution, Hyper-V, 0x80370102, kernel, vmcompute, LxssManager, instance-is-corrupted, etc). Unrelated configure failures (apt repo issues, script bugs) keep their original error code so users don't get a misleading 'reboot to fix' suggestion.

openclaw#5 + openclaw#15: New tests — Engine_SelfHeal_ResetsCancelledStatus, Engine_ConfigureFails_AfterJustInstalled_DoesNotRemapUnrelatedFailure, LooksLikePostInstallKernelIssue_MatchesKnownKernelSignatures, PairAsync_AwaitsConfiguredDelay, PairAsync_CancellationDuringDelay_Propagates.

openclaw#7: OnboardingV2Bridge now treats LocalGatewaySetupStatus.RequiresAdmin like other terminal/blocking statuses — surfaces error message via LocalizeFailureMessage and suppresses the running-phase info card.

openclaw#8: WslPlatformInstall DefaultPostInstallProbeAttempts comment rewritten to reflect real worst case (~33s, but typically 1–2 probes) rather than the misleading '6 × 500ms = 3s'.

openclaw#9: Extracted WslInstallCycleIssueCodes static class with const issue codes + IReadOnlySet<string> All. Replaced 5 duplicated string literals (top-level self-heal RemoveAll + 4 EnsureWslEnabled outcome branches) with the constants.

openclaw#10: CreateLocalOnly factory now reads OPENCLAW_PAIR_RETRY_DELAY_MS env var at wire-up and passes it as SettingsWindowsTrayNodeProvisioner pairRetryDelay so operators can tune the retry without rebuilding.

openclaw#11: LocalizeFailureMessage now exposes an isGenericFallback signal; MaybeAppendSetupDiagnosticsHint skips the 'Setup diagnostics: …' wrap when we returned the V2_Progress_GenericFailure placeholder (the generic message already directs to logs).

openclaw#13: SetupExistingGatewayClassifier.ClassifyAsync accepts an optional IOpenClawLogger so WSL probe failures surface in the diagnostic stream instead of being silently swallowed by NullLogger.

openclaw#14: ComputeInfoMessage phase predicate tightened to (status==Running && phase<=EnsureWslEnabled) so the 'Check system' hint stops once the engine moves past EnsureWslEnabled, including transient non-Running states between phases.

openclaw#16: Preflight port-probe (host-side TCP check) now runs unconditionally rather than being gated on WSL platform.State==Installed, so users find out about host-side port conflicts before spending 30+ seconds on WSL install.

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217 (+5), Connection 229 all pass; ARM64 publish succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 2 dual-model review of commit 6bf187f produced 12 findings; this commit fixes 8 (A–F, H, I). G (probe i18n heuristic) was attempted but reverted — the proposed change to treat unknown wsl --status output as NotInstalled mis-classified valid Unknown cases (policy-blocked, generic failures) that have pinning tests.

A — IsLocalLoopbackGatewayUrl now parses URI properly via Uri.TryCreate + IPAddress.IsLoopback; rejects ws://localhost.evil.com and ws://127.attacker.com; accepts ws://[::1] IPv6 loopback.

B — LooksLikePostInstallKernelIssue tightened: returns false for null/whitespace Detail (was true), no longer matches bare 'kernel' substring (now requires 'WSL kernel'/'WSL2 kernel'/'kernel component'/'kernel image'). Test cases extended.

C — App.xaml.cs onboarding window: release suppression token BEFORE clearing _onboardingWindow=null in both Completed and Closed handlers (race window with gateway event pump); wrap Activate() in try/catch to release token on setup failure.

D — Counter clamp on over-release replaced with CAS loop saturating decrement; eliminates Interlocked.Exchange(0) clobbering a concurrent Acquire after over-release.

E — _wslPlatformJustInstalled reset at top of RunLocalOnlyAsync so a hypothetical reused engine instance doesn't carry the flag into a second run.

F — CreateWslInstance failures now also gated through the post-install reboot remap (symmetric with ConfigureWslInstance); kernel-state errors hit during the instance install on a fresh WSL platform now surface as 'restart your PC' instead of 'Couldn't download Ubuntu'.

H — Post-install probe retry loop now continues on Unknown state too (was bailing on first state change); final (Unknown, exitCode==0) maps to InstalledRequiresRestart instead of Failed, so successful installs on slow ARM64 hosts aren't mis-reported as Failed. Updated test Installer_RemapsUnknownProbeToRequiresRestart_WhenExitCodeIsZero.

I — OPENCLAW_PAIR_RETRY_DELAY_MS now clamped to 60s upper bound; unparseable values logged as Warn instead of silently ignored.

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 229 all pass; ARM64 publish succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 3 dual-model review of commit 6473c1a produced 11 findings; this commit fixes 7 (A, B, C, D, F, G, I). E (retry-loop early-break on stable NotInstalled) was attempted but reverted — conflicts with the pinned Installer_RetriesPostProbe_RidingOutFinalizeRace test that exercises a 3-NotInstalled-then-Installed convergence.

A — Added IsLocalLoopbackGatewayUrlTests (27 theory cases) pinning the URI-parsing replacement against hostile lookalikes (ws://localhost.evil.com, ws://127.attacker.com, ws://0.0.0.0), wrong schemes (http/file), private IPs, malformed input, plus positive coverage for ws://localhost, ws://127.0.0.2, ws://[::1], wss://[::ffff:127.0.0.1].

B — LooksLikePostInstallKernelIssue gained a (detail, postFreshInstall) overload. When the caller already knows WSL was just installed in this session AND the engine's collected Detail is blank, default to true. This restores the Round-1 protection for the common case where WslRegisterDistribution HRESULTs / host-compute warmup errors write to the Windows event log rather than stderr, leaving result.Detail empty. Strict-signature gate still wins for confirmed-unrelated errors (apt/script/permission). All 4 callers updated to pass postFreshInstall: true since they already gate on _wslPlatformJustInstalled.

C — Widened the App.xaml.cs onboarding-window try/catch to cover BOTH event-handler wirings AND Activate(). The Round-2 try/catch only covered Activate, leaving a token-leak window if += wiring threw. Catch handler also nulls _onboardingWindow so the next ShowOnboardingAsync() doesn't trip over a half-broken reference. Updated misleading comment (Round-3 I) to describe the actual invariant (no-onboarding-in-progress => no-suppression-active) instead of an impossible intra-thread race.

D — Dropped 'volatile' on _suppressNodeAutoApproveCount; Interlocked already provides full barriers and pairing with Interlocked.*(ref volatile-field) triggers CS0420 warnings that would break the build under TreatWarningsAsErrors. Bare reads now use Volatile.Read for explicit acquire fence.

F — Replaced manual bracket-strip + IPAddress.TryParse with uri.IsLoopback. Subsumes localhost / 127.0.0.0/8 / [::1] / [::ffff:127.0.0.1] in one call; correctly handles IPv6 zone IDs like [fe80::1%eth0] that the previous code mishandled. Wildcard 0.0.0.0 stays rejected (it's the 'any' bind address; not loopback).

G — SetupExistingGatewayClassifier now distinguishes WslPlatformState.Unknown from NotInstalled. Unknown (transient probe timeout / policy block) returns false (no app-owned distro confirmed) instead of falling back to local-setup evidence alone, preventing a flaky probe from steering the user into the 'replace existing local gateway' UX path.

I — Rewrote the misleading Round-2 comment on the OnboardingCompleted reorder; new text describes the real invariant rather than the impossible intra-thread race claim.

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 256 (+27 IsLocalLoopback theory rows) all pass; ARM64 publish succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 4 dual-model review of commit e2f0d7c found no CRITICAL/HIGH issues — branch is in good shape. This commit applies the 3 minor findings both reviewers agreed were worth fixing.

A — Extended IsLocalLoopbackGatewayUrlTests with 4 more theory rows to pin behavior so future .NET runtime changes to Uri.IsLoopback can't silently shift the security gate: case-insensitive host (LOCALHOST), userinfo (user@localhost), decimal-encoded loopback (2130706433 -> 127.0.0.1, accepted), and percent-encoded host (%6cocalhost, rejected at parse).

B — SetupExistingGatewayClassifier Unknown-branch warning now routes through probeLogger instead of the static Logger sink, matching the deliberate pattern (and its own comment ~30 lines above) that ensures WSL probe breadcrumbs reach the same diagnostic stream as the rest of setup.

C — App.xaml.cs ShowOnboardingAsync catch block now restores the previously-connected gateway (if disconnectedForOnboarding && restoreGatewayId != null) before rethrowing. Without this, an exception during onboarding-window setup left the user silently disconnected with no UI feedback since callers fire-and-forget (_ = ShowOnboardingAsync()).

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 260 (+4 coverage rows) all pass; ARM64 publish succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Manual test 2026-05-22 (Round-5 ship-readiness check): Phase 14 fails with 'Pairing this PC as a node failed' on first attempt, but the actual pairing succeeds ~25ms later — UI shows red X while logs show 'Received device token - we are now paired!'. Try-again works.

Root cause: GatewayConnectionManager.EnsureNodeConnectedAsync stale-state race.

Sequence:

1. First Phase-14 connect → gateway returns PAIRING_REQUIRED (role-upgrade)

2. Engine approves via WSL CLI; 5s wait

3. Retry connect: StartNodeConnectionAsync kicks off new connect attempt

4. State machine STILL shows NodeState=PairingRequired from attempt 1 (state machine doesn't reset on a new connect)

5. New connect's SimulateStatus(Connected) → NodeConnected trigger REJECTED from PairingRequired guard → EmitStateChanged still fires with stale snapshot

6. Handler's PairingRequired-when-suppressed case trips → tcs.SetException → engine reports FailedRetryable

7. ~25ms later the real hello-ok arrives with deviceToken → pairing actually succeeds — but the engine has already given up

Fix: capture the pre-existing NodePairingRequestId at EnsureNodeConnectedAsync entry. Handler ignores PairingRequired events whose requestId MATCHES the captured stale one — only fast-fails on a FRESH PairingRequired event (different requestId) tied to the new attempt. Also tightened the synchronous Handler re-eval after StartNodeConnectionAsync to only act on truly terminal states (Connected+Paired, PairingRejected, Error).

Test: EnsureNodeConnectedAsync_StaleSnapshotAfterPrevPairingRequired_DoesNotFastFailNewAttempt — poisons state machine with PairingRequired + suppression-active, then verifies a successful retry connect transitions to Connected+Paired instead of fast-failing on the stale snapshot.

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 261 (+1 regression test) all pass; ARM64 publish succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uality patch

Round 6 dual-model review found the requestId-equality guard introduced in 373c24a is fragile — if the gateway re-emits PairingRequired with the same requestId on a fast retry, the guard wrongly swallows the legitimate fresh fast-fail and the engine hangs the 35s timeout. Both reviewers flagged the same edge case; both pointed at the cleaner structural fix (commit 373c24a's own message even hinted at it: 'the state machine doesn't reset on a new connect').

A — Replaced the EnsureNodeConnectedAsync stale-requestId guard with a structural reset at the source. StartNodeConnectionAsync now calls _stateMachine.StartNodeConnecting() which transitions the node sub-FSM to Connecting (also from PairingRequired / PairingRejected / RateLimited / Idle / Error) and RebuildSnapshot clears the stale NodePairingRequestId. The Handler can now treat any PairingRequired-when-suppressed event as authoritative without requestId disambiguation — subsumes the same-requestId edge case, the stale-null edge case, and the SetNodeInfo null-coalesce side-effect.

C — Added RateLimited as a terminal state in both the Handler switch and the synchronous isTerminal gate. NodeRateLimited trigger is not currently wired into GatewayConnectionManager, but the moment it gets wired up the prior code would hang the 35s default instead of surfacing a useful error.

B — Added EnsureNodeConnectedAsync_FreshPairingRequiredAfterReset_FastFailsWhenSuppressed companion test. The existing StaleSnapshotAfterPrevPairingRequired test only exercises the synchronous-terminal-gate half of the fix; this new test pins the fresh-PairingRequired-fast-fail path so a future refactor can't silently break it.

Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 262 (+1 fresh-PairingRequired test) all pass; ARM64 publish succeeds.

Manual smoke test of 373c24a's UI symptom (Phase 14 false-failure) already succeeded on clean ARM64 machine before this round; this commit retains the user-visible behavior while making the underlying mechanism more robust.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 7 reviewers (Opus + Codex consensus) flagged that review_diff_373c24a.txt and review_diff_373c24a_test.txt were committed to the repo root in e48a15e — reviewer-artifact patches that should never have been tracked.

Remove the two stray files and add review_diff_*.txt to .gitignore so future Hanselman rounds don't accidentally re-introduce them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per PR-prep feedback, scrub all branch-introduced 'Round-N' / 'Hanselman' / dated 'manual test YYYY-MM-DD' references from source and test comments. The comments now describe the behavior and rationale without referring to specific review rounds or dates.

Also revert the .gitignore change that added 'review_diff_*.txt' — reviewer-artifact files don't need a project-level ignore rule.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 23, 2026

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-23 01:51 UTC / May 22, 2026, 9:51 PM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds elevated WSL platform auto-install to local onboarding, scoped node auto-approve suppression, node retry-state resets, localized setup messaging, docs, and regression tests.

Reproducibility: yes. from source inspection for the missing-WSL path: current main turns a failed wsl --status into blocking wsl_unavailable. I did not run a clean Windows reproduction from this Linux review environment.

PR rating
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Summary: The patch has substantial source coverage and no confirmed correctness finding, but missing real behavior proof caps merge confidence for the elevated Windows setup flow.

Rank-up moves:

  • Add redacted real behavior proof for the clean-machine ARM64 onboarding path, including the UAC-driven WSL install or restart outcome.
  • Confirm maintainers accept in-app UAC elevation for WSL platform install.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body reports green validation and a manual ARM64 smoke, but no inspectable screenshot, recording, terminal output, linked artifact, or redacted logs are present for the after-fix clean-machine onboarding path. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
A visible Windows desktop proof would materially help because the central change is first-run onboarding plus UAC-driven WSL install. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

visual task: verify clean Windows local setup prompts for WSL install, reaches the restart-or-continue state, and resumes onboarding after restart if required.

Risk before merge

  • The PR adds a new UAC-elevated process launch from onboarding to install WSL, so maintainers should explicitly accept that product and security boundary before merge.
  • The PR body reports manual ARM64 validation, but there is no inspectable proof artifact showing the clean-machine WSL install, restart-or-continue state, and resumed onboarding path.

Maintainer options:

  1. Require installer-path proof (recommended)
    Ask for redacted runtime proof showing the UAC prompt, WSL install outcome, and resumed onboarding on a clean Windows machine before merging the elevated install path.
  2. Accept the elevated installer boundary
    Maintainers may intentionally accept in-app UAC elevation once proof shows the flow is understandable and recoverable for first-run users.
  3. Split out the non-elevated fixes
    If the elevated installer UX is not ready, keep or close this PR in favor of a narrower branch containing only the pairing/state reset and messaging fixes that do not launch UAC.

Next step before merge
The remaining blocker is maintainer acceptance plus contributor proof for an elevated Windows installer flow, not a narrow automation repair.

Security
Cleared: No concrete exploit or supply-chain defect was found, but the elevated WSL installer path is security-sensitive and needs maintainer acceptance.

Review details

Best possible solution:

Land the onboarding robustness work after maintainers approve the elevated-installer boundary and the contributor adds inspectable clean-Windows proof; otherwise split the pairing/state-machine fixes from the auto-install UX.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection for the missing-WSL path: current main turns a failed wsl --status into blocking wsl_unavailable. I did not run a clean Windows reproduction from this Linux review environment.

Is this the best way to solve the issue?

Unclear until maintainer product/security review: the implementation is plausible and well covered by source tests, but in-app elevated WSL installation is a new user-facing boundary and needs real setup proof.

Label justifications:

  • P2: First-run local setup reliability is a normal-priority user-facing improvement with limited blast radius.
  • merge-risk: 🚨 security-boundary: The PR adds a new UAC-elevated process launch from the onboarding wizard to install WSL.
  • rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch has substantial source coverage and no confirmed correctness finding, but missing real behavior proof caps merge confidence for the elevated Windows setup flow.
  • feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. Automatic WSL platform install could remove a major first-run setup cliff for Windows companion users if the elevated flow is accepted.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body reports green validation and a manual ARM64 smoke, but no inspectable screenshot, recording, terminal output, linked artifact, or redacted logs are present for the after-fix clean-machine onboarding path. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

Likely related people:

  • shanselman: Blame ties the current preflight and EnsureNodeConnectedAsync surfaces to Scott Hanselman, and the prior WSL validation blocker issue was authored and closed from the same account. (role: introduced behavior and prior WSL setup reviewer; confidence: high; commits: 33848b1365cf, 45dccec60c51; files: src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs, src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Connection/ConnectionStateMachine.cs)
  • Mike Harsh: Recent history shows multiple changes in the local gateway setup, V2 bridge, diagnostics, and WSL retry areas touched by this PR. (role: recent local setup contributor; confidence: high; commits: 8446ea552779, 7e11bc02778b, 4baf01d768c0; files: src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs, src/OpenClaw.Tray.WinUI/Onboarding/V2/OnboardingV2Bridge.cs)
  • ranjeshj: Recent connection-manager history includes own-node pairing auto-approve work adjacent to the PR's suppression and retry-state changes. (role: adjacent node auto-approve contributor; confidence: medium; commits: 7747e6075052, b69579f1f10f; files: src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Tray.WinUI/App.xaml.cs, src/OpenClaw.Tray.WinUI/Services/NodeService.cs)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 8714381248a2.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels May 23, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 23, 2026

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@RBrid RBrid marked this pull request as ready for review May 23, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant