fix: treat failed Nomad allocations as completed#401
fix: treat failed Nomad allocations as completed#401ThetaSinner wants to merge 1 commit intomainfrom
Conversation
WalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The following will be added to the changelog [0.5.0-dev.1] - 2025-12-19Bug Fixes
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nomad/scripts/wait_for_jobs.sh (1)
29-42:get_status()uses$alloc_idwithout defining/passing it (can break underset -uin some paths).
Today it “works” only becausealloc_idhappens to exist globally in the main loop, but this is brittle and can produce misleading error messages.Suggested fix:
-function get_status() { - local nomad_status="$1" +function get_status() { + local alloc_id="$1" + local nomad_status="$2" # Fetch JSON and extract ClientStatus; fail fast if the command errors. local client_status if ! client_status="$(echo "$nomad_status" | jq -r '.ClientStatus')"; then echo "Failed to retrieve status for allocation ID: $alloc_id" >&2 exit 1 fi @@ echo "$client_status" }And update the call site:
- status="$(get_status "$nomad_status")" + status="$(get_status "$alloc_id" "$nomad_status")"
🧹 Nitpick comments (3)
nomad/scripts/wait_for_jobs.sh (3)
117-125: Treating failed allocations as “completed” matches the goal; tighten the failure path for robustness.
Now that failure is a “successful” exit path, please make sure the logs are consistently useful and don’t accidentally tripset -e. Concretely:
- Consider prefixing failure log lines with a clear marker (so the summariser can reliably detect them).
- Consider explicitly handling
print_failed_tasks_and_logserrors so a logging failure doesn’t kill the script.Proposed tweak (keeps semantics but avoids logging causing an exit):
- print_failed_tasks_and_logs "$alloc_id" "$nomad_status" + print_failed_tasks_and_logs "$alloc_id" "$nomad_status" || \ + echo "Warning: failed to print task failures/logs for $scenario_name ($alloc_id)" >&2
59-87: Avoid global leakage: makefailed_taskslocal; consider redacting/noising-sensitive logs.
Because failures are now treated as completion, this function will run more often; avoid polluting global state (failed_tasksis currently global). Also,nomad alloc logsoutput can include secrets depending on workloads—worth an explicit acknowledgement/guardrail if this runs in CI logs.function print_failed_tasks_and_logs() { local alloc_id="$1" local status="$2" @@ - failed_tasks=$(jq -r ' + local failed_tasks + failed_tasks=$(jq -r ' @@ echo "Failed task: $failed_tasks" }
89-129: ScopeELAPSED_SECSas local towait_for_job().
Prevents accidental reuse/corruption if this script grows.- ELAPSED_SECS=0 + local ELAPSED_SECS=0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nomad/scripts/wait_for_jobs.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (macos-latest)
- GitHub Check: test (ubuntu-latest)
nomad/scripts/wait_for_jobs.sh
Outdated
| else | ||
| echo "Scenario $scenario_name ($alloc_id) finished with status=$status after $ELAPSED_SECS seconds." | ||
| print_failed_tasks_and_logs "$alloc_id" "$nomad_status" | ||
| echo "Note: Failed allocations are treated as completed. The summariser will catch any issues." |
There was a problem hiding this comment.
The summariser will catch any issues
That statement sounds a bit absolute to me. Are sure that the summariser would really catch any issues? If a runner dies half-way for example this may not show up as a summariser failure I think, would it?
There was a problem hiding this comment.
The rabbit picked up on this comment and I responded below. It's not that the summariser catches this in a try/catch/retry kind of sense. We do report the number of agents that ran to completion, which is supposed to be tied to these numbers. So if 1/10 succeeded in running to completion then we would be able to see that in the data.
matthme
left a comment
There was a problem hiding this comment.
It looks okay to me but I have a question regarding whether we really catch all error cases still.
9d64836 to
f7e6630
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)
166-171: Consider adding summary logging for partial failures.The exit logic correctly implements the all-or-nothing failure strategy described in the PR objectives. However, when some (but not all) allocations fail, the script exits successfully with no summary message about the partial failure.
Consider adding a summary before line 172:
+# Log summary of results +if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then + echo "Warning: $FAILED_ALLOCATIONS out of $TOTAL_ALLOCATIONS allocation(s) failed, but continuing as not all failed." >&2 +fi + # If all allocations failed, exit with error if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2 exit 1 fi exit 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nomad/scripts/wait_for_jobs.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test (macos-latest)
🔇 Additional comments (2)
nomad/scripts/wait_for_jobs.sh (2)
118-127: LGTM! Correctly changed exit behavior to return codes.The change from
break/exittoreturn 0/return 1properly allows the script to continue processing remaining allocations after encountering a failed one, which is the core objective of this PR.
150-161: LGTM! Aggregation logic is correct.The tracking of total and failed allocations is implemented correctly and integrates well with the per-allocation return codes from
wait_for_job.
| # See https://developer.hashicorp.com/nomad/commands/job/status#running | ||
| function is_running() { | ||
| local status="$1" | ||
| case "$status" in | ||
| complete|failed|lost|stopped|dead|unknown) | ||
| return 1 ;; # not running | ||
| *) | ||
| running) | ||
| return 0 ;; # still running (e.g., pending, running) | ||
| *) | ||
| return 1 ;; # not running | ||
| esac | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n nomad/scripts/wait_for_jobs.sh | head -100Repository: holochain/wind-tunnel
Length of output: 3359
🏁 Script executed:
cat -n nomad/scripts/wait_for_jobs.sh | sed -n '100,180p'Repository: holochain/wind-tunnel
Length of output: 2786
Fix handling of "pending" status in is_running() function.
The is_running() function on line 48 only checks for "running" status, but Nomad allocations can also be in "pending" state when waiting to be scheduled. When an allocation is pending, the function returns 1, causing the wait loop to exit and treat the allocation as completed. Since pending ≠ complete, it gets marked as failed.
Additionally, the comment on line 49 states "still running (e.g., pending, running)" but the code no longer handles "pending" as running, creating an inconsistency.
Update the function to treat both pending and running as active states:
function is_running() {
local status="$1"
case "$status" in
- running)
- return 0 ;; # still running (e.g., pending, running)
+ pending|running)
+ return 0 ;; # still running
*)
return 1 ;; # not running
esac
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # See https://developer.hashicorp.com/nomad/commands/job/status#running | |
| function is_running() { | |
| local status="$1" | |
| case "$status" in | |
| complete|failed|lost|stopped|dead|unknown) | |
| return 1 ;; # not running | |
| *) | |
| running) | |
| return 0 ;; # still running (e.g., pending, running) | |
| *) | |
| return 1 ;; # not running | |
| esac | |
| } | |
| # See https://developer.hashicorp.com/nomad/commands/job/status#running | |
| function is_running() { | |
| local status="$1" | |
| case "$status" in | |
| pending|running) | |
| return 0 ;; # still running | |
| *) | |
| return 1 ;; # not running | |
| esac | |
| } |
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 44 to 53, the is_running()
function only treats "running" as active while the comment claims "pending"
should also be considered active; update the case pattern to treat both
"pending" and "running" as active states (return 0) and adjust the inline
comment to match, so the wait loop continues when allocations are pending or
running and only exits for non-active statuses.
There was a problem hiding this comment.
According to the latest documentation, the status returned by the CLI is simpler than that https://developer.hashicorp.com/nomad/commands/job/status
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Nomad alloc status ClientStatus values
💡 Result:
Common ClientStatus values seen on Nomad allocations include: pending, running, complete, failed, lost, and unknown. These appear in the API/CLI outputs for allocations (e.g., /v1/allocations, nomad job allocs and operator client-state examples). [1][2]
Sources:
- Nomad API — Allocations (examples showing ClientStatus = "running" / "pending"). [1]
- nomad job allocs / operator client-state documentation (CLI examples showing ClientStatus values such as "pending", "running", "complete"). [2]
🌐 Web query:
Nomad alloc status vs job status ClientStatus allocation
💡 Result:
Short answer — they’re different scopes:
-
Job status: a high-level view of the job as submitted to Nomad (CLI values: running or dead; UI shows richer states like Healthy, Recovering, Stopped, Scaled Down). It summarizes how the job’s allocations compare to the desired state. [1][2]
-
Allocation (alloc) ClientStatus: the emergent, per-allocation runtime state reported by the client for a specific allocation (common values: pending, running, complete, failed, lost). Allocation objects also include Desired Status (the scheduler’s goal for that alloc, e.g., run or stop). Use nomad alloc status to inspect an alloc’s ClientStatus and task events. [3][1]
Practical mapping:
- A job can be “running” while some of its allocations are “pending” (being placed/restarted) or some are “failed” — the job status reflects the overall picture across its allocations, while ClientStatus is the concrete state of one allocation on a client. [2][4]
References:
[1] Nomad alloc status (command reference).
[2] Nomad Job concepts / job status.
[3] Node/alloc Client Status description (Desired vs Client Status).
[4] Job status / allocation examples showing pending vs running.
@ThetaSinner, I appreciate you checking the documentation! However, there's an important distinction here:
The script uses nomad alloc status (line 27) and extracts .ClientStatus (line 34), which refers to allocation status, not job status. These are different:
- Job status (from
nomad job status): High-level view with simpler values like "running" or "dead" [as shown in the link you provided] - Allocation ClientStatus (from
nomad alloc status): Per-allocation runtime state with values including:pending,running,complete,failed,lost, andunknown[1][2]
According to the Nomad documentation, allocations can be in a pending state when they're waiting to be scheduled or placed on a client node. With the current code only checking for running, any allocation in pending state would be treated as "not running" and potentially marked as failed even though it's still active.
Could you verify in your testing whether you've observed allocations in pending state, or if they transition directly from another state to running? If allocations do pass through pending, the function should handle it to avoid premature failure detection.
f7e6630 to
7de0e33
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
nomad/scripts/wait_for_jobs.sh (1)
44-53: Comment-code mismatch: "pending" mentioned but not handled.The comment on line 49 states "(e.g., pending, running)" but only
runningis matched. If an allocation is inpendingstate (waiting to be scheduled),is_running()returns 1, causing the wait loop to exit and treat the allocation as failed since it's not "complete".Either update the case to include
pending, or correct the comment if the intent is to treat pending as not-running:function is_running() { local status="$1" case "$status" in - running) - return 0 ;; # still running (e.g., pending, running) + pending|running) + return 0 ;; # still running *) return 1 ;; # not running esac }
🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)
166-172: Consider a minimum success threshold.Currently, the script succeeds if at least one allocation completes successfully. This means a run with 9/10 failed allocations still exits 0, potentially masking infrastructure issues.
Consider adding a configurable threshold (e.g., at least 50% must succeed):
+# Minimum percentage of allocations that must succeed (default: 50%) +MIN_SUCCESS_PERCENT="${MIN_SUCCESS_PERCENT:-50}" + # If all allocations failed, exit with error -if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then +SUCCEEDED=$((TOTAL_ALLOCATIONS - FAILED_ALLOCATIONS)) +SUCCESS_PERCENT=$((SUCCEEDED * 100 / TOTAL_ALLOCATIONS)) + +if [[ $SUCCESS_PERCENT -lt $MIN_SUCCESS_PERCENT ]]; then + echo "Error: Only $SUCCEEDED/$TOTAL_ALLOCATIONS allocation(s) succeeded ($SUCCESS_PERCENT% < $MIN_SUCCESS_PERCENT% threshold)." >&2 - echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2 exit 1 fi + +if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then + echo "Warning: $FAILED_ALLOCATIONS/$TOTAL_ALLOCATIONS allocation(s) failed, but success threshold met." +fi
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nomad/scripts/wait_for_jobs.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.
Applied to files:
nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (ubuntu-latest)
🔇 Additional comments (2)
nomad/scripts/wait_for_jobs.sh (2)
118-126: LGTM!Good refactoring to use return codes instead of exiting, enabling proper aggregation of allocation results. Failed allocations are logged with details before returning.
150-164: LGTM!The allocation counting logic is correct. Using
if ! wait_for_jobproperly handles the return code in a conditional context, avoiding issues withset -e.
7de0e33 to
6586d18
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)
166-172: Consider logging a summary when some (but not all) allocations failed.When partial failures occur, the script exits 0 silently. Adding a summary line would improve observability without changing the exit behavior:
# If all allocations failed, exit with error if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2 exit 1 fi + +if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then + echo "Warning: $FAILED_ALLOCATIONS of $TOTAL_ALLOCATIONS allocation(s) failed. Proceeding to summariser." +fi exit 0
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nomad/scripts/wait_for_jobs.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.
Applied to files:
nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test (macos-latest)
🔇 Additional comments (3)
nomad/scripts/wait_for_jobs.sh (3)
118-127: LGTM!The refactored completion handling correctly returns 0/1 instead of exiting, enabling proper aggregation of results. Failed allocations are still logged with details before returning.
150-152: LGTM!Counter initialization is correctly placed after the
shiftthat removes the scenario name, so$#accurately reflects the allocation count.
155-164: LGTM!The
if !pattern correctly handles the non-zero return fromwait_for_jobunderset -e, and the failure counting logic is sound.
6586d18 to
5543323
Compare
Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running. This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.
5543323 to
2a04d3b
Compare
|
✔️ 2a04d3b - Conventional commits check succeeded. |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nomad/scripts/wait_for_jobs.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.
Applied to files:
nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test (macos-latest)
🔇 Additional comments (1)
nomad/scripts/wait_for_jobs.sh (1)
118-126: LGTM! Return values enable proper aggregation.The change from
exittoreturnis correct and necessary for the new partial-failure handling. Both success and failure paths properly log allocation status and return appropriate codes (0 for success, 1 for failure), allowing the main loop to aggregate results across all allocations.
| # See https://developer.hashicorp.com/nomad/commands/job/status#running | ||
| function is_running() { | ||
| local status="$1" | ||
| case "$status" in | ||
| complete|failed|lost|stopped|dead|unknown) | ||
| return 1 ;; # not running | ||
| running) | ||
| return 0 ;; # still running | ||
| *) | ||
| return 0 ;; # still running (e.g., pending, running) | ||
| return 1 ;; # not running | ||
| esac | ||
| } |
There was a problem hiding this comment.
The is_running() function should treat "pending" as an active state.
Nomad allocations can have a ClientStatus of "pending", which represents allocations waiting to be scheduled on a client node. The official Nomad documentation shows examples of allocations with DesiredStatus="run" and ClientStatus="pending" as valid intermediate states.
The current implementation only returns 0 for "running", treating "pending" allocations as not running. This causes:
- Loop exits when allocation enters "pending" state
is_run_success("pending")returnsfalse- Allocation marked as failed
HashiCorp's own documentation recommends filtering for allocations with ClientStatus of "running" or "pending", which contradicts the current implementation that only checks "running".
Update the function to handle both states:
function is_running() {
local status="$1"
case "$status" in
running|pending)
return 0 ;;
*)
return 1 ;;
esac
}
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 44-53, the is_running() helper
currently returns success only for "running", treating "pending" as not active
and causing loops to exit and allocations to be marked failed; update the case
pattern so that both "running" and "pending" return 0 (i.e., treat pending as an
active intermediate state), leaving other statuses to return 1; after this
change, ensure any callers (e.g., is_run_success) rely on is_running() semantics
or are adjusted if they separately check for "pending".
| # If all allocations failed, exit with error | ||
| if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then | ||
| echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2 | ||
| exit 1 | ||
| fi | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "ci_allocs.sh" -o -name "wait_for_jobs.sh" 2>/dev/null | head -20Repository: holochain/wind-tunnel
Length of output: 126
🏁 Script executed:
cat nomad/scripts/ci_allocs.sh | head -100Repository: holochain/wind-tunnel
Length of output: 3424
🏁 Script executed:
grep -n "peer_end_count\|peer_count\|ClientStatus\|failed.*alloc" nomad/scripts/ci_allocs.sh -A 3 -B 3Repository: holochain/wind-tunnel
Length of output: 656
🏁 Script executed:
grep -n "make_allocs_csv\|generate_run_summary" nomad/scripts/ci_allocs.sh -A 10 -B 5Repository: holochain/wind-tunnel
Length of output: 1446
🏁 Script executed:
grep -n "make_allocs_csv\|alloc_ids" nomad/scripts/wait_for_jobs.sh -A 5 -B 5Repository: holochain/wind-tunnel
Length of output: 369
🏁 Script executed:
sed -n '146,180p' nomad/scripts/wait_for_jobs.shRepository: holochain/wind-tunnel
Length of output: 735
🏁 Script executed:
rg "wait_for_jobs" --type sh -A 5 -B 5 | head -80Repository: holochain/wind-tunnel
Length of output: 47
🏁 Script executed:
grep -rn "wait_for_jobs\|nomad/scripts/wait_for_jobs.sh" . --include="*.sh" --include="*.nomad" 2>/dev/null | head -20Repository: holochain/wind-tunnel
Length of output: 47
🏁 Script executed:
find . -name "*.nomad" -o -name "*.hcl" 2>/dev/null | xargs grep -l "wait_for_jobs" 2>/dev/nullRepository: holochain/wind-tunnel
Length of output: 49
🏁 Script executed:
cat nomad/scripts/wait_for_jobs.shRepository: holochain/wind-tunnel
Length of output: 4779
🏁 Script executed:
grep -rn "make_allocs_csv\|generate_run_summary" . --include="*.sh" --include="*.nomad" 2>/dev/null | grep -v "Binary"Repository: holochain/wind-tunnel
Length of output: 574
🏁 Script executed:
sed -n '25,50p' nomad/scripts/ci_allocs.shRepository: holochain/wind-tunnel
Length of output: 980
🏁 Script executed:
grep -rn "make_allocs_csv\|alloc_ids" . --include="*.sh" -B 10 -A 2 | grep -A 15 "make_allocs_csv"Repository: holochain/wind-tunnel
Length of output: 2840
🏁 Script executed:
grep -rn "wait_for_jobs\|make_allocs_csv" . --include="*.sh" --include="*.yml" --include="*.yaml" 2>/dev/null | grep -v "Binary"Repository: holochain/wind-tunnel
Length of output: 772
🏁 Script executed:
sed -n '155,170p' .github/workflows/nomad.yamlRepository: holochain/wind-tunnel
Length of output: 799
🏁 Script executed:
sed -n '125,160p' .github/workflows/nomad.yamlRepository: holochain/wind-tunnel
Length of output: 1850
Orchestration does not filter failed allocations; peer_end_count is set statically from config.
The risk scenario in your review is valid. In ci_allocs.sh, the generate_run_summary function sets both peer_count and peer_end_count to the behavior count from the job config (line 105: peer_end_count: ($bs | length)), regardless of allocation completion status.
The allocation IDs passed to make_allocs_csv come directly from Nomad's job run output without status filtering. If an allocation is created by Nomad but fails mid-execution, it remains in the alloc_ids list and gets written to the CSV. Since peer_end_count is derived statically from the configured behaviors array—not from actual work completion—the summarizer cannot detect if work was incomplete.
The script's assumption that the downstream summarizer will detect partial failures via peer_end_count is unfounded: that metric reflects only the configured behavior count, not the actual completion count. If allocations fail before completing their work, the summarizer has no signal to detect this.
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 166-171, the script exits only on
all-allocations-failed but downstream summarization uses a statically set
peer_end_count from job config; to fix, change the pipeline that builds
alloc_ids and generates the run summary so peer_end_count is derived from actual
completed/healthy allocations rather than the configured behaviors count: query
Nomad for each allocation's status (e.g., running/complete/failed) after job
run, filter alloc_ids to include only allocations that reached a
successful/completed state (or add a status column to the CSV), and set
peer_end_count to the count of those successful allocations (or completed tasks)
before calling generate_run_summary/make_allocs_csv so partial failures are
reflected accurately.
Summary
Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running.
This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.
TODO:
Note that all commits in a PR must follow Conventional Commits before it can be merged, as these are used to generate the changelog
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.