OpenClaw Windows Companion App Setup Robustness Improvements#523
OpenClaw Windows Companion App Setup Robustness Improvements#523RBrid wants to merge 12 commits into
Conversation
Detects the 'Windows Subsystem for Linux is not installed' case in <2s during preflight via a new WslPlatformProbe (anchors on aka.ms/wslinstall URL + falls back to English banner + cheap wsl.exe-on-disk check), short-circuiting the otherwise-30s wsl --list --verbose probe that was causing the ~1-minute hang on hosts without WSL. When the platform is missing, preflight now emits a non-blocking wsl_platform_not_installed Warning instead of a generic blocking wsl_unavailable, so the engine proceeds past preflight. The existing EnsureWslEnabled phase (previously a no-op) runs wsl.exe --install --no-distribution elevated via ShellExecute (UAC prompt) through a new IWslPlatformInstaller. Outcomes are classified by re-probing after exit: - InstalledNoRestart -> engine continues to CreateWslInstance - InstalledRequiresRestart (exit 3010 or probe-still-missing on exit 0) -> Status=RequiresRestart, FailureCode=wsl_install_requires_restart - UserDeclinedElevation -> retryable wsl_install_elevation_declined - Failed -> retryable wsl_install_failed with stderr tail No new LocalGatewaySetupPhase enum value (avoids shifting ordinals that would break persisted setup-state.json and the LocalSetupProgressStageMap ordering). The install runs under the existing 'Check system' wizard stage. Docs updated in ONBOARDING_WIZARD.md and WINDOWS_NODE_TESTING.md. Validation: build.ps1 OK, Shared.Tests 1803 passed/28 skipped, Tray.Tests 1171 passed (was 1149 baseline + 22 new tests covering probe banner matching, installer outcome classification, and engine EnsureWslEnabled wiring for each outcome). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Real-world testing on a Cadmus ARM64 host without WSL installed surfaced a sequence of issues the initial implementation didn't cover. Each fix below was driven by a specific log + screenshot from those test rounds, then narrowed at the root. Engine / setup flow - HasDistroAsync + SetupExistingGatewayClassifier.HasAppOwnedLocalWslGateway now short-circuit on a fast WslPlatformProbe so the "Set up locally" click path no longer hangs ~30s on the wsl --list 30s timeout when WSL is missing. - Engine self-heal generalised: on RunLocalOnlyAsync entry, any persisted non-Pending/Running/Complete status (RequiresRestart, FailedRetryable, FailedTerminal, Blocked, Cancelled, RequiresAdmin) is reset to Pending + NotStarted so Try-again and post-reboot relaunches actually retry instead of replaying a cached failure. - OpenClawCliGatewayServiceManager.InstallAsync now stops any pre-existing openclaw-gateway service before the port-conflict probe, so a Try-again after a half-completed run doesn't fail on its own previous gateway. - 3s pause before the wsl --install UAC popup so the user has time to read the inline "Windows will ask for permission" hint (skippable via env var, excluded in test builds). - _wslPlatformJustInstalled flag turns ConfigureWslInstance failures right after a fresh wsl --install into a specific "Windows Subsystem for Linux was just installed - restart your PC" error code instead of the generic "follow aka.ms/wsllogs" message. - WslStoreInstanceInstaller emits wsl_instance_install_no_network when wsl --install <distro> output matches known Microsoft Store / offline HRESULTs, with a friendlier message. - WslPlatformProbe bounds wsl --status with its own 5s timeout via linked CTS so a hung wsl.exe can't block the wizard for the engine-wide 30s. Connection manager (root-cause for the pairing cascade) - New IGatewayConnectionManager.SuppressNodeAutoApprove flag, set true while OnboardingWindow is open. While set, the manager skips its own operator-side node.pair.approve RPC (which was the wrong RPC for role-upgrade pendings: those carry a DEVICE-pair requestId, so the call hits "unknown requestId" and triggers a gateway service restart). - EnsureNodeConnectedAsync now throws immediately on RoleConnectionState PairingRequired when SuppressNodeAutoApprove is set, so the V2 engine catches and drives `openclaw devices approve` via the WSL CLI well inside the gateway's ~10s pending-approval window. Without this the 35s default wait guaranteed the gateway was already gone by the time the engine got control to approve. - App.OnGatewayConnectionStatusChanged suppresses its post-operator- connect TryConnectLocalNodeServiceAsync auto-fire while the V2 onboarding engine is alive so a single sequenced node-pair runs in the foreground. V2 wizard UI - BuildStageRow now embeds the optional inline hint AND inline error card INSIDE the row container instead of as sibling VStack children. Fixes FunctionalUI position-based reconciliation reusing the previous render's BorderElement (background brush) when the failed/running row index shifts between renders. This is what produced the "blue box around Installing Ubuntu" and "pink box around random rows" the user reported across several rounds. - InvisibleBadge explicitly sets Background(Transparent) so the reconciler can't leak a previous render's green CheckmarkBadge or pink ErrorBadge fill onto an idle row (the "phantom dots"). - ComputeStageState + ShouldShowErrorRow now handle RequiresRestart so the offending stage shows a clear failure marker + error card with the reboot instruction instead of an infinite spinner. - LocalSetupProgressPage seeds CheckSystem=Running + a localised "Checking your PC; Windows will ask for permission to install WSL" hint BEFORE the engine starts so the user gets immediate feedback while preflight runs. Localisation - 14 new V2_Progress_* keys in V2Strings.cs and all five .resw locales (en-us, fr-fr, nl-nl, zh-cn, zh-tw): CheckSystem_Hint, Wsl_Installing, Wsl_RequiresRestart, Wsl_Failed, Wsl_ElevationDeclined, Wsl_Unavailable, Wsl_NotResponding, Wsl_FirstBootAfterInstall, Wsl_NoNetwork, Wsl_ConfigFailed, Wsl_InstanceInstallFailed, Preflight_NotReady, OpenClawInstallFailed, GatewayPortInUse, Node_PairingFailed, GenericFailure. - OnboardingV2Bridge.LocalizeFailureMessage maps every known engine FailureCode that can plausibly surface during local setup to one of the resw keys, with a generic "Setup failed. See logs for details" fallback so an empty engine UserMessage can never render an empty red card. Tests - WslPlatformInstallTests: probe / installer / engine wiring coverage, including locale-stable banner matching, post-install probe retry finalize-race, UAC decline, generic-exception classification, fast- fail on probe timeout, and engine resume after a stale RequiresRestart state. - SettingsWindowsTrayNodeProvisioner: injectable pairRetryDelay/delayAsync so the role-upgrade approve+retry path stays fast in tests. - LocalSetupProgressStageMapTests updated for RequiresRestart -> error row visibility. - All up: Shared.Tests 1813/28 skipped, Tray.Tests 1208, Connection.Tests 229. Docs - ONBOARDING_WIZARD.md documents the new InstallWslPlatform behaviour under the existing Check system stage. - WINDOWS_NODE_TESTING.md adds a manual smoke for the fresh-Windows- without-WSL case + the "Set up locally stalls / This PC is not ready" troubleshooting path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bring in upstream master (origin/master @ ece297a, 64 commits) so the WSL local-setup wizard work stays current. Resolved conflicts in 3 files: - docs/WINDOWS_NODE_TESTING.md — kept both the new 'Set up locally stalls' troubleshooting section and the upstream 'Local sandbox validation' section. - src/OpenClaw.Tray.WinUI/Onboarding/V2/OnboardingV2Bridge.cs — composed localized failure message and the new AppendSetupDiagnosticsHint helper (errorMessage = AppendSetupDiagnosticsHint(LocalizeFailureMessage(code, msg))). Kept the new RequiresRestart entry in the status check. - src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs — threaded both the new wslPlatformInstaller and diagnosticsSink ctor params, preserved the engine self-heal + WslPlatformProbe-gated HasDistroAsync, and kept the upstream _diagnostics.RunStarted call in RunLocalOnlyAsync. Factory now passes both wslPlatformInstaller and diagnosticsSink. Validation: build OK, Shared.Tests 1891/29 skipped, Tray.Tests 1212, Connection.Tests 229. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from dual-model code review on user/regisb/WSLInstall1 branch. openclaw#1 + openclaw#2 + openclaw#6: SuppressNodeAutoApprove (process-global bool, racy, idempotency-fragile) replaced with IDisposable AcquireNodeAutoApproveSuppression() using volatile int + Interlocked counter. Token Dispose is idempotent via Interlocked.Exchange. Local-loopback URL gate scopes suppression to ws://localhost / ws://127.* so manual operator connects to remote gateways still surface their pending-approval toast. openclaw#3: ElevatedWslPlatformInstaller.RunDefaultAsync now wraps WaitForExitAsync in try/catch(OperationCanceledException), attempts best-effort Kill(entireProcessTree:true) on the launcher (will fail for the elevated child — documented), then rethrows so the engine surfaces Cancelled cleanly. openclaw#4: ConfigureWslInstance failure remap to wsl_firstboot_config_failed_after_install is now gated on a stderr-signature check (LooksLikePostInstallKernelIssue: WslRegisterDistribution, Hyper-V, 0x80370102, kernel, vmcompute, LxssManager, instance-is-corrupted, etc). Unrelated configure failures (apt repo issues, script bugs) keep their original error code so users don't get a misleading 'reboot to fix' suggestion. openclaw#5 + openclaw#15: New tests — Engine_SelfHeal_ResetsCancelledStatus, Engine_ConfigureFails_AfterJustInstalled_DoesNotRemapUnrelatedFailure, LooksLikePostInstallKernelIssue_MatchesKnownKernelSignatures, PairAsync_AwaitsConfiguredDelay, PairAsync_CancellationDuringDelay_Propagates. openclaw#7: OnboardingV2Bridge now treats LocalGatewaySetupStatus.RequiresAdmin like other terminal/blocking statuses — surfaces error message via LocalizeFailureMessage and suppresses the running-phase info card. openclaw#8: WslPlatformInstall DefaultPostInstallProbeAttempts comment rewritten to reflect real worst case (~33s, but typically 1–2 probes) rather than the misleading '6 × 500ms = 3s'. openclaw#9: Extracted WslInstallCycleIssueCodes static class with const issue codes + IReadOnlySet<string> All. Replaced 5 duplicated string literals (top-level self-heal RemoveAll + 4 EnsureWslEnabled outcome branches) with the constants. openclaw#10: CreateLocalOnly factory now reads OPENCLAW_PAIR_RETRY_DELAY_MS env var at wire-up and passes it as SettingsWindowsTrayNodeProvisioner pairRetryDelay so operators can tune the retry without rebuilding. openclaw#11: LocalizeFailureMessage now exposes an isGenericFallback signal; MaybeAppendSetupDiagnosticsHint skips the 'Setup diagnostics: …' wrap when we returned the V2_Progress_GenericFailure placeholder (the generic message already directs to logs). openclaw#13: SetupExistingGatewayClassifier.ClassifyAsync accepts an optional IOpenClawLogger so WSL probe failures surface in the diagnostic stream instead of being silently swallowed by NullLogger. openclaw#14: ComputeInfoMessage phase predicate tightened to (status==Running && phase<=EnsureWslEnabled) so the 'Check system' hint stops once the engine moves past EnsureWslEnabled, including transient non-Running states between phases. openclaw#16: Preflight port-probe (host-side TCP check) now runs unconditionally rather than being gated on WSL platform.State==Installed, so users find out about host-side port conflicts before spending 30+ seconds on WSL install. Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217 (+5), Connection 229 all pass; ARM64 publish succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 2 dual-model review of commit 6bf187f produced 12 findings; this commit fixes 8 (A–F, H, I). G (probe i18n heuristic) was attempted but reverted — the proposed change to treat unknown wsl --status output as NotInstalled mis-classified valid Unknown cases (policy-blocked, generic failures) that have pinning tests. A — IsLocalLoopbackGatewayUrl now parses URI properly via Uri.TryCreate + IPAddress.IsLoopback; rejects ws://localhost.evil.com and ws://127.attacker.com; accepts ws://[::1] IPv6 loopback. B — LooksLikePostInstallKernelIssue tightened: returns false for null/whitespace Detail (was true), no longer matches bare 'kernel' substring (now requires 'WSL kernel'/'WSL2 kernel'/'kernel component'/'kernel image'). Test cases extended. C — App.xaml.cs onboarding window: release suppression token BEFORE clearing _onboardingWindow=null in both Completed and Closed handlers (race window with gateway event pump); wrap Activate() in try/catch to release token on setup failure. D — Counter clamp on over-release replaced with CAS loop saturating decrement; eliminates Interlocked.Exchange(0) clobbering a concurrent Acquire after over-release. E — _wslPlatformJustInstalled reset at top of RunLocalOnlyAsync so a hypothetical reused engine instance doesn't carry the flag into a second run. F — CreateWslInstance failures now also gated through the post-install reboot remap (symmetric with ConfigureWslInstance); kernel-state errors hit during the instance install on a fresh WSL platform now surface as 'restart your PC' instead of 'Couldn't download Ubuntu'. H — Post-install probe retry loop now continues on Unknown state too (was bailing on first state change); final (Unknown, exitCode==0) maps to InstalledRequiresRestart instead of Failed, so successful installs on slow ARM64 hosts aren't mis-reported as Failed. Updated test Installer_RemapsUnknownProbeToRequiresRestart_WhenExitCodeIsZero. I — OPENCLAW_PAIR_RETRY_DELAY_MS now clamped to 60s upper bound; unparseable values logged as Warn instead of silently ignored. Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 229 all pass; ARM64 publish succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 3 dual-model review of commit 6473c1a produced 11 findings; this commit fixes 7 (A, B, C, D, F, G, I). E (retry-loop early-break on stable NotInstalled) was attempted but reverted — conflicts with the pinned Installer_RetriesPostProbe_RidingOutFinalizeRace test that exercises a 3-NotInstalled-then-Installed convergence. A — Added IsLocalLoopbackGatewayUrlTests (27 theory cases) pinning the URI-parsing replacement against hostile lookalikes (ws://localhost.evil.com, ws://127.attacker.com, ws://0.0.0.0), wrong schemes (http/file), private IPs, malformed input, plus positive coverage for ws://localhost, ws://127.0.0.2, ws://[::1], wss://[::ffff:127.0.0.1]. B — LooksLikePostInstallKernelIssue gained a (detail, postFreshInstall) overload. When the caller already knows WSL was just installed in this session AND the engine's collected Detail is blank, default to true. This restores the Round-1 protection for the common case where WslRegisterDistribution HRESULTs / host-compute warmup errors write to the Windows event log rather than stderr, leaving result.Detail empty. Strict-signature gate still wins for confirmed-unrelated errors (apt/script/permission). All 4 callers updated to pass postFreshInstall: true since they already gate on _wslPlatformJustInstalled. C — Widened the App.xaml.cs onboarding-window try/catch to cover BOTH event-handler wirings AND Activate(). The Round-2 try/catch only covered Activate, leaving a token-leak window if += wiring threw. Catch handler also nulls _onboardingWindow so the next ShowOnboardingAsync() doesn't trip over a half-broken reference. Updated misleading comment (Round-3 I) to describe the actual invariant (no-onboarding-in-progress => no-suppression-active) instead of an impossible intra-thread race. D — Dropped 'volatile' on _suppressNodeAutoApproveCount; Interlocked already provides full barriers and pairing with Interlocked.*(ref volatile-field) triggers CS0420 warnings that would break the build under TreatWarningsAsErrors. Bare reads now use Volatile.Read for explicit acquire fence. F — Replaced manual bracket-strip + IPAddress.TryParse with uri.IsLoopback. Subsumes localhost / 127.0.0.0/8 / [::1] / [::ffff:127.0.0.1] in one call; correctly handles IPv6 zone IDs like [fe80::1%eth0] that the previous code mishandled. Wildcard 0.0.0.0 stays rejected (it's the 'any' bind address; not loopback). G — SetupExistingGatewayClassifier now distinguishes WslPlatformState.Unknown from NotInstalled. Unknown (transient probe timeout / policy block) returns false (no app-owned distro confirmed) instead of falling back to local-setup evidence alone, preventing a flaky probe from steering the user into the 'replace existing local gateway' UX path. I — Rewrote the misleading Round-2 comment on the OnboardingCompleted reorder; new text describes the real invariant rather than the impossible intra-thread race claim. Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 256 (+27 IsLocalLoopback theory rows) all pass; ARM64 publish succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 4 dual-model review of commit e2f0d7c found no CRITICAL/HIGH issues — branch is in good shape. This commit applies the 3 minor findings both reviewers agreed were worth fixing. A — Extended IsLocalLoopbackGatewayUrlTests with 4 more theory rows to pin behavior so future .NET runtime changes to Uri.IsLoopback can't silently shift the security gate: case-insensitive host (LOCALHOST), userinfo (user@localhost), decimal-encoded loopback (2130706433 -> 127.0.0.1, accepted), and percent-encoded host (%6cocalhost, rejected at parse). B — SetupExistingGatewayClassifier Unknown-branch warning now routes through probeLogger instead of the static Logger sink, matching the deliberate pattern (and its own comment ~30 lines above) that ensures WSL probe breadcrumbs reach the same diagnostic stream as the rest of setup. C — App.xaml.cs ShowOnboardingAsync catch block now restores the previously-connected gateway (if disconnectedForOnboarding && restoreGatewayId != null) before rethrowing. Without this, an exception during onboarding-window setup left the user silently disconnected with no UI feedback since callers fire-and-forget (_ = ShowOnboardingAsync()). Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 260 (+4 coverage rows) all pass; ARM64 publish succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Manual test 2026-05-22 (Round-5 ship-readiness check): Phase 14 fails with 'Pairing this PC as a node failed' on first attempt, but the actual pairing succeeds ~25ms later — UI shows red X while logs show 'Received device token - we are now paired!'. Try-again works. Root cause: GatewayConnectionManager.EnsureNodeConnectedAsync stale-state race. Sequence: 1. First Phase-14 connect → gateway returns PAIRING_REQUIRED (role-upgrade) 2. Engine approves via WSL CLI; 5s wait 3. Retry connect: StartNodeConnectionAsync kicks off new connect attempt 4. State machine STILL shows NodeState=PairingRequired from attempt 1 (state machine doesn't reset on a new connect) 5. New connect's SimulateStatus(Connected) → NodeConnected trigger REJECTED from PairingRequired guard → EmitStateChanged still fires with stale snapshot 6. Handler's PairingRequired-when-suppressed case trips → tcs.SetException → engine reports FailedRetryable 7. ~25ms later the real hello-ok arrives with deviceToken → pairing actually succeeds — but the engine has already given up Fix: capture the pre-existing NodePairingRequestId at EnsureNodeConnectedAsync entry. Handler ignores PairingRequired events whose requestId MATCHES the captured stale one — only fast-fails on a FRESH PairingRequired event (different requestId) tied to the new attempt. Also tightened the synchronous Handler re-eval after StartNodeConnectionAsync to only act on truly terminal states (Connected+Paired, PairingRejected, Error). Test: EnsureNodeConnectedAsync_StaleSnapshotAfterPrevPairingRequired_DoesNotFastFailNewAttempt — poisons state machine with PairingRequired + suppression-active, then verifies a successful retry connect transitions to Connected+Paired instead of fast-failing on the stale snapshot. Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 261 (+1 regression test) all pass; ARM64 publish succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…uality patch Round 6 dual-model review found the requestId-equality guard introduced in 373c24a is fragile — if the gateway re-emits PairingRequired with the same requestId on a fast retry, the guard wrongly swallows the legitimate fresh fast-fail and the engine hangs the 35s timeout. Both reviewers flagged the same edge case; both pointed at the cleaner structural fix (commit 373c24a's own message even hinted at it: 'the state machine doesn't reset on a new connect'). A — Replaced the EnsureNodeConnectedAsync stale-requestId guard with a structural reset at the source. StartNodeConnectionAsync now calls _stateMachine.StartNodeConnecting() which transitions the node sub-FSM to Connecting (also from PairingRequired / PairingRejected / RateLimited / Idle / Error) and RebuildSnapshot clears the stale NodePairingRequestId. The Handler can now treat any PairingRequired-when-suppressed event as authoritative without requestId disambiguation — subsumes the same-requestId edge case, the stale-null edge case, and the SetNodeInfo null-coalesce side-effect. C — Added RateLimited as a terminal state in both the Handler switch and the synchronous isTerminal gate. NodeRateLimited trigger is not currently wired into GatewayConnectionManager, but the moment it gets wired up the prior code would hang the 35s default instead of surfacing a useful error. B — Added EnsureNodeConnectedAsync_FreshPairingRequiredAfterReset_FastFailsWhenSuppressed companion test. The existing StaleSnapshotAfterPrevPairingRequired test only exercises the synchronous-terminal-gate half of the fix; this new test pins the fresh-PairingRequired-fast-fail path so a future refactor can't silently break it. Validation: build.ps1 green; Shared 1891/29 skipped, Tray 1217, Connection 262 (+1 fresh-PairingRequired test) all pass; ARM64 publish succeeds. Manual smoke test of 373c24a's UI symptom (Phase 14 false-failure) already succeeded on clean ARM64 machine before this round; this commit retains the user-visible behavior while making the underlying mechanism more robust. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 7 reviewers (Opus + Codex consensus) flagged that review_diff_373c24a.txt and review_diff_373c24a_test.txt were committed to the repo root in e48a15e — reviewer-artifact patches that should never have been tracked. Remove the two stray files and add review_diff_*.txt to .gitignore so future Hanselman rounds don't accidentally re-introduce them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per PR-prep feedback, scrub all branch-introduced 'Round-N' / 'Hanselman' / dated 'manual test YYYY-MM-DD' references from source and test comments. The comments now describe the behavior and rationale without referring to specific review rounds or dates. Also revert the .gitignore change that added 'review_diff_*.txt' — reviewer-artifact files don't need a project-level ignore rule. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-23 01:51 UTC / May 22, 2026, 9:51 PM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. from source inspection for the missing-WSL path: current main turns a failed PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the onboarding robustness work after maintainers approve the elevated-installer boundary and the contributor adds inspectable clean-Windows proof; otherwise split the pairing/state-machine fixes from the auto-install UX. Do we have a high-confidence way to reproduce the issue? Yes from source inspection for the missing-WSL path: current main turns a failed Is this the best way to solve the issue? Unclear until maintainer product/security review: the implementation is plausible and well covered by source tests, but in-app elevated WSL installation is a new user-facing boundary and needs real setup proof. Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 8714381248a2. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
Summary
Make the local-gateway setup wizard work end-to-end on a clean Windows machine — including auto-installing the WSL platform when it's missing — so first-run users get from "Set up locally" to a paired node without manual
wsl --install+ reboot rituals or terminal commands.End-state validation: full 7-step onboarding succeeds first-try on a clean ARM64 box, with no manual intervention.
What this PR delivers
1. Elevated WSL platform install (the core feature)
WslPlatformProbe(locale-tolerantwsl --statusparser anchored on theaka.ms/wslinstallURL) detects "platform missing" vs "platform present but unhealthy" without hanging on the default 30s WSL runner timeout — uses an independent 5s probe ceiling.ElevatedWslPlatformInstallershells out towsl --install --no-distributionviaShellExecute "runas"(UAC prompt). RecognisesERROR_SUCCESS_REBOOT_REQUIRED(3010), retries the post-install probe to ride out the lifted-WSL warmup race (typically 1–2 probes, worst case ~33s on slow ARM64), and surfacesCancelledcleanly when the user declines UAC.LocalGatewaySetup.cs): preflight emitswsl_platform_not_installedas aWarning(not blocking);EnsureWslEnabledphase consumes the warning, runs the installer, then re-runs preflight. Self-heal logic resets staleStatus=RequiresRestart / RequiresAdmin / Cancelled / FailedTerminal / Blockedentries on engine entry so the wizard can re-evaluate the host after the user reboots / re-launches.WslInstallCycleIssueCodesstatic class extracts the 7 cycle-related issue codes from string literals into named constants — used in 5 sites for the self-heal RemoveAll + outcome-mapping inEnsureWslEnabled.2. Onboarding-window suppression of manager-side node auto-approve
During the V2 onboarding engine's Phase-14 (
PairWindowsTrayNode), the engine drives canonical role-upgrade approval via the WSL CLI (openclaw devices approve). TheGatewayConnectionManager's own RPC-basednode.pair.approveauto-approver would race against that with the wrong requestId namespace (DEVICE-pair requestIds surface on node-pair pending events during role-upgrade), causing the gateway to "unknown requestId" and then service-restart mid-pair.IGatewayConnectionManager.SuppressNodeAutoApprove(bool, process-global, racy) →AcquireNodeAutoApproveSuppression(): IDisposable. Reference-counted viaInterlockedCAS; tokens are idempotent (Interlocked.Exchange(ref _owner, null)inDispose); over-release is a no-op (saturating decrement). Scope is gated to local-loopback gateways only viaIsLocalLoopbackGatewayUrl(url)(properUri.IsLoopbackparse — rejectsws://localhost.evil.example.com,ws://0.0.0.0, acceptsws://[::1]and decimal-encoded loopback).App.xaml.cs.ShowOnboardingAsync: acquires a suppression token for the onboarding window's lifetime; releases it from bothOnboardingCompletedandClosedhandlers (idempotent). Release happens before_onboardingWindow=nullso observers using window-null as a "finished" signal also observe suppression released. Wire-up +Activate()are wrapped intry/catchso a throw can't leak the token; the catch path also restores the previously-connected gateway if onboarding had disconnected it (otherwise the user would be silently stranded sinceShowOnboardingAsyncis called fire-and-forget).3. State-machine reset on retry (fixes a real Phase-14 false-failure)
Surfaced by manual ARM64 testing:
Received device token - we are now paired!).PAIRING_REQUIRED(role-upgrade) and the engine approved it via WSL CLI, the retry connect'sStartNodeConnectionAsyncwas kicked off but the state machine still showedNodeState=PairingRequiredfrom the previous attempt. The retry's first event (SimulateStatus(Connected)) didn't transition (NodeConnectedrejected fromPairingRequired);EmitStateChangedstill fired with the stale snapshot;EnsureNodeConnectedAsync's Handler tripped the suppression-fast-fail on the stale state.ConnectionStateMachine.StartNodeConnecting()now also resets fromPairingRequired/PairingRejected/RateLimited/Error/Idle(was onlyIdle/Error).RebuildSnapshotclearsNodePairingRequestIdand resetsNodePairingStatustoUnknownon the way out ofPairingRequired.StartNodeConnectionAsyncnow callsStartNodeConnecting()so every new attempt starts from a clean slate.RateLimitedto bothEnsureNodeConnectedAsync's Handler switch and its synchronousisTerminalgate (latent —NodeRateLimitedis not currently wired, but the moment it gets wired the wait would hang the 35s default).4. V2 onboarding bridge
OnEngineStateChangednow surfaces error message forRequiresAdmin(was onlyRequiresRestart+ FailedRetryable/Terminal/Blocked).LocalizeFailureMessagenow returns(string?, out bool isGenericFallback);MaybeAppendSetupDiagnosticsHintskips the "Setup diagnostics: …" wrap when the V2_Progress_GenericFailure placeholder was used (already directs to logs; double-hint clutters the small error card).ComputeInfoMessagephase predicate tightened fromphase is NotStarted or Preflight or EnsureWslEnabledtostatus == Running && phase <= EnsureWslEnabledso the generic "Check system" hint stops once the engine moves past EnsureWslEnabled (downstream phases own their own labels).wsl_*failure codes toLocalizeFailureMessageand corresponding strings to all five locale resw files (en-us,fr-fr,nl-nl,zh-cn,zh-tw).5. Smaller engine fixes
LooksLikePostInstallKernelIssue(detail, postFreshInstall)overload — when caller knows WSL was just installed in this session ANDresult.Detailis blank (common:WslRegisterDistributionHRESULTs go to the Windows event log, not stderr), default totrueso the user gets the correct "restart your PC" recommendation. Strict signature match still wins for known-unrelated failures (apt/script/permission). Reset_wslPlatformJustInstalledat the top of everyRunLocalOnlyAsync.OPENCLAW_PAIR_RETRY_DELAY_MSenv var override onSettingsWindowsTrayNodeProvisionerctor (clamped to 60s upper bound; warn on unparseable values).CreateWslInstancephase now routed through the same_wslPlatformJustInstalled + LooksLikePostInstallKernelIssuereboot-remap asConfigureWslInstance(symmetric: kernel-state errors hit during instance install on fresh WSL surface as "restart your PC" instead of "Couldn't download Ubuntu")._portProbe.IsPortAvailable) now runs unconditionally — only the WSL-sideIsExistingGatewayPortAsyncrecovery path is gated onplatform.State == Installed. Users find out about host-side port conflicts before spending 30+ seconds on WSL install.SetupExistingGatewayClassifier:ClassifyAsyncaccepts optionalIOpenClawLogger;Unknownplatform state is now treated separately fromNotInstalled(flaky probe no longer mislabels setup asAppOwnedLocalWslfrom local-setup evidence alone).ElevatedWslPlatformInstaller.RunDefaultAsynccatchesOperationCanceledException, attempts best-effortKill(entireProcessTree:true), rethrows — documented limitation that the elevated child runs under a different token and may complete in the background.6. Process / hygiene
volatileon_suppressNodeAutoApproveCount(fixes CS0420 —Interlockedalready provides full barriers); bare reads useVolatile.Readfor explicit acquire fence.Test coverage
All behavioural changes are pinned by regression tests. New / modified test files:
tests/OpenClaw.Connection.Tests/IsLocalLoopbackGatewayUrlTests.cs(NEW, 27 theory rows) — pins URI-parsing replacement against hostile lookalikes, wildcards, IPv6, decimal-encoded loopback, percent-encoded host.tests/OpenClaw.Connection.Tests/GatewayConnectionManagerTests.cs(+102 lines) —EnsureNodeConnectedAsync_StaleSnapshotAfterPrevPairingRequired_DoesNotFastFailNewAttempt(Phase-14 false-fail regression) +EnsureNodeConnectedAsync_FreshPairingRequiredAfterReset_FastFailsWhenSuppressed(companion pinning the suppression-fast-fail path still works).tests/OpenClaw.Tray.Tests/WslPlatformInstallTests.cs(+945 lines) — coversWslPlatformProbe(locale-tolerant banner match, status timeout fast-fail, RecordingWslCommandRunner / HangingWslCommandRunner),ElevatedWslPlatformInstaller(RequiresRestart exit code, NotInstalled-after-success → RequiresRestart, Unknown post-probe → RequiresRestart, clamp-attempts, finalize-race retry, RequiresAdmin → RequiresAdmin not Failed, post-reboot resume of persisted state, self-heal ofCancelled/FailedTerminal/Blocked/RequiresAdmin,Engine_ConfigureFails_AfterJustInstalled_DoesNotRemapUnrelatedFailure,LooksLikePostInstallKernelIssuestrict-signature theory rows, preflight port-probe unconditional, preflight timeout fast-fail).tests/OpenClaw.Tray.Tests/WindowsTrayNodePairingApprovalTests.cs(+75 lines) —PairAsync_AwaitsConfiguredDelay,PairAsync_CancellationDuringDelay_Propagates.Validation
./build.ps1— greendotnet test tests/OpenClaw.Shared.Tests/...— 1891 pass / 29 skippeddotnet test tests/OpenClaw.Tray.Tests/...— 1235 passdotnet test tests/OpenClaw.Connection.Tests/...— 262 passdotnet publish ... -r win-arm64 --self-contained— greenFiles changed
Major source-code surface:
src/OpenClaw.Connection/GatewayConnectionManager.cs(+208 lines)src/OpenClaw.Connection/IGatewayConnectionManager.cs(+27 lines)src/OpenClaw.Connection/ConnectionStateMachine.cs(+17 lines)src/OpenClaw.Tray.WinUI/App.xaml.cs(+171 lines)src/OpenClaw.Tray.WinUI/Onboarding/V2/OnboardingV2Bridge.cs(+182 lines)src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs(+588 lines)src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/WslPlatformInstall.cs(+406 lines, new)src/OpenClaw.Tray.WinUI/Services/SetupExistingGatewayClassifier.cs(+53 lines)src/OpenClaw.Tray.WinUI/Strings/{en-us,fr-fr,nl-nl,zh-cn,zh-tw}/Resources.resw(+53 each)src/OpenClawTray.OnboardingV2/Pages/LocalSetupProgressPage.cs(+78)src/OpenClawTray.OnboardingV2/V2Strings.cs(+37)docs/ONBOARDING_WIZARD.md(+13),docs/WINDOWS_NODE_TESTING.md(+9)Total: 27 files changed, +3158 / −136 lines.