Log Helix queue health summary on job submission (opt-in)#16922
Log Helix queue health summary on job submission (opt-in)#16922chcosta wants to merge 5 commits into
Conversation
Surfaces the new `queueStats` block returned by `POST /api/jobs` (queue depth, estimated wait, average run duration, snapshot time) so that users who submit Helix jobs can see queue health at submit time. Behavior: * Off by default. Opt in via MSBuild `EnableShowHelixQueueStats=true` or, for direct JobSender consumers, `.WithShowQueueStats(true)` on the fluent `IJobDefinition`. * Emits a short, human-readable summary with snapshot time converted to the local timezone. * Emits an MSBuild `warning :` line when the estimated wait exceeds the 30-minute SLA (the queue is at capacity / unhealthy) or when the stats snapshot is more than 15 minutes stale. * Prints a one-time preview-feature banner and a one-time link to the dnceng First Responders Teams channel per process to keep log noise low across multi-job builds. Changes: * New `Models.QueueStats` partial bound to the `queueStats` JSON. * Partial `Models.JobCreationResult` extension exposing `QueueStats` (keeps the generated client file untouched). * New `IJobDefinition.WithShowQueueStats(bool)` fluent method and `JobDefinition` implementation gating the new log output. * New `SendHelixJob.EnableShowHelixQueueStats` task input plumbed via `Microsoft.DotNet.Helix.Sdk.MonoQueue.targets` and defaulted to `false` in `Microsoft.DotNet.Helix.Sdk.props`.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “Helix queue health” summary that surfaces queue wait/depth/snapshot info returned by the Helix POST /api/jobs response, so Arcade/Helix job submitters can see queue conditions at submit time without visiting the Helix portal.
Changes:
- Introduces a new MSBuild property (
EnableShowHelixQueueStats, defaultfalse) and plumbs it throughSendHelixJobintoIJobDefinition.WithShowQueueStats(...). - Extends the Helix client model (
JobCreationResult) with aqueueStatsblock and adds aQueueStatsJSON model. - Implements logging in
JobDefinition.SendAsyncto print a compact summary plus SLA/staleness warnings (once-per-process banner and support link).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/Microsoft.DotNet.Helix/Sdk/tools/Microsoft.DotNet.Helix.Sdk.props | Adds defaulted opt-in property EnableShowHelixQueueStats=false. |
| src/Microsoft.DotNet.Helix/Sdk/tools/Microsoft.DotNet.Helix.Sdk.MonoQueue.targets | Passes EnableShowHelixQueueStats into the SendHelixJob task invocation. |
| src/Microsoft.DotNet.Helix/Sdk/SendHelixJob.cs | Adds a task parameter and forwards it to IJobDefinition.WithShowQueueStats(...). |
| src/Microsoft.DotNet.Helix/JobSender/IJobDefinition.cs | Adds the fluent API method WithShowQueueStats(bool). |
| src/Microsoft.DotNet.Helix/JobSender/JobDefinition.cs | Implements the flag and logs queue health after job submission. |
| src/Microsoft.DotNet.Helix/Client/CSharp/QueueStats.cs | Adds the QueueStats JSON-bound model for the new response block. |
| src/Microsoft.DotNet.Helix/Client/CSharp/JobCreationResult.cs | Partial-class extension to expose QueueStats from the job creation response. |
prefer server returned queue name rather than client side queue name Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Addresses reviewer feedback on dotnet#16922: the boolean parameter was a 'boolean trap' at call sites. Renamed to a no-arg WithQueueStats() opt-in method; SendHelixJob now branches on the EnableShowHelixQueueStats MSBuild flag instead of piping the bool through the fluent API.
| return "unknown"; | ||
| } | ||
|
|
||
| DateTime local = utc.LocalDateTime; |
There was a problem hiding this comment.
Using local time seems inconsistent. Is UTC better? I haven't looked around but I feel that UTC is much more common in our builds.
There was a problem hiding this comment.
It's true that UTC is more common, and how we store data on the server. Given that this is a QoL (and client-side) type change, I personally, found it annoying to constantly do the translation. If there's a strong opinion, I can move this back to UTC.
…sta/add-helix-queue-stats-logging
…sta/add-helix-queue-stats-logging
Summary
Surfaces the new
queueStatsblock thatPOST /api/jobsnow returns from the Helix service, so anyone submitting a Helix job through the Arcade SDK can see the current health of their target queue at submit time without having to dig into the portal.The feature is opt-in via a new MSBuild property and a matching fluent-API method, so it adds zero log noise to builds that haven't opted in.
User-visible change
Opt-in
or
/p:EnableShowHelixQueueStats=true. DirectMicrosoft.DotNet.Helix.JobSendercallers use.WithQueueStats()onIJobDefinition.What lands in the build log
Healthy queue:
At-capacity queue (estimated wait exceeds the 30-minute SLA):
Stale snapshot (> 15 min old): a second
warning :line is emitted noting the age and threshold; the snapshot line also gets a(stale)tag.Noise-control choices
note :) prints once per process.Pacific Daylight Time) ΓÇö no more guessing UTC offsets.warning :prefix so it shows up in PR build summaries and AzDO timeline issues.Implementation
src/Microsoft.DotNet.Helix/Client/CSharp/QueueStats.cs(new)queueStatsJSON (queueName,depth,averageRunDuration,estimatedWait,estimatedWaitMethod,generatedAt).src/Microsoft.DotNet.Helix/Client/CSharp/JobCreationResult.cs(new)QueueStats. Keeps the generatedgenerated-code/Models/JobCreationResult.csuntouched.src/Microsoft.DotNet.Helix/JobSender/IJobDefinition.csWithQueueStats()fluent method.src/Microsoft.DotNet.Helix/JobSender/JobDefinition.csJobApi.NewAsync(...)returns, if the flag is on, callsLogQueueStats(...)which formats and emits the summary, SLA + stale warnings, and once-per-process banner / Teams link.src/Microsoft.DotNet.Helix/Sdk/SendHelixJob.csEnableShowHelixQueueStatstask input piped into.WithQueueStats().src/Microsoft.DotNet.Helix/Sdk/tools/Microsoft.DotNet.Helix.Sdk.MonoQueue.targetsEnableShowHelixQueueStats="$(EnableShowHelixQueueStats)"onto the<SendHelixJob />element.src/Microsoft.DotNet.Helix/Sdk/tools/Microsoft.DotNet.Helix.Sdk.propsEnableShowHelixQueueStats=false(mirrors the existingEnableHelixJobMonitorpattern).Validation
Unit tests
Microsoft.DotNet.Helix.Sdk.TestsΓÇö 121/121 pass with these changes.End-to-end against live
helix.dot.netI packed the modified packages locally (
11.0.0-dev) intoarcade/artifacts/packages/Debug/NonShipping/and consumed them from a separate scratch repo (c:\repos\arcade-helix-stats-validate) that importsMicrosoft.DotNet.Helix.Sdkvia a fork-pinnedNuGet.config. From that repo:Gate OFF (default) ΓÇö submitted a tiny job to
windows.10.amd64.open. Build log contained only the existingSending Job to .../Created job list at .../Sent Helix Job; see work items at ...lines. Zero queue-health output, confirming the opt-in gate works.Gate ON via
/p:EnableShowHelixQueueStats=trueΓÇö same submit, against the same queue. Build log included the full preview banner + health summary + First Responders link shown above. Queue happened to be healthy (depth 0), so no SLA warning.Gate ON against a backlogged queue (
osx.15.amd64.open) ΓÇö exercised separately via a local programmatic probe consumingMicrosoft.DotNet.Helix.JobSenderdirectly. Output included the[AT CAPACITY]tag and thewarning : ... exceeds the 30-minute SLA - ...line shown above. Confirmed the once-per-process gating on both the banner and the Teams link by submitting twice in the same process and observing the second submission's output omits both.Stale-snapshot path
Live snapshots from the service were < 1 min old during validation, so the
(stale)tag and "snapshot is N old" warning weren't observed live ΓÇö both are exercised by code review of JobDefinition.cs and gated bySnapshotStaleThreshold = TimeSpan.FromMinutes(15).Out of scope
The Helix service is shipping additional
queueStatsfields (e.g.P50Wait,P95Wait,HealthyCores,ObserverUiUrl,EstimatedWaitConfidence). This PR intentionally only binds the fields needed to render the current preview summary; surfacing the richer telemetry can land in a follow-up once the field set stabilizes.