Skip to content

fix: treat failed Nomad allocations as completed#401

Open
ThetaSinner wants to merge 1 commit intomainfrom
fix/treat-failed-allocations-as-complete
Open

fix: treat failed Nomad allocations as completed#401
ThetaSinner wants to merge 1 commit intomainfrom
fix/treat-failed-allocations-as-complete

Conversation

@ThetaSinner
Copy link
Member

@ThetaSinner ThetaSinner commented Dec 12, 2025

Summary

Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running.

This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.

TODO:

  • All code changes are reflected in docs, including module-level docs
  • I ran the Nomad CI workflow successfully on my branch

Note that all commits in a PR must follow Conventional Commits before it can be merged, as these are used to generate the changelog

Summary by CodeRabbit

  • Bug Fixes
    • Only the "running" status is treated as active; other statuses are considered not running.
    • Per-allocation waits now report clear completion or failure messages and capture failed-task details for troubleshooting.
    • Tracks total vs failed allocations, marks individual failures, and only exits with error when all allocations fail; otherwise proceeds normally.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

Walkthrough

The nomad/scripts/wait_for_jobs.sh script was updated to treat only "running" as running, make wait_for_job return status codes (0/1) instead of exiting, and aggregate per-allocation results to exit non-zero only if every allocation failed.

Changes

Cohort / File(s) Summary
Nomad wait script
nomad/scripts/wait_for_jobs.sh
- is_running now considers only "running" as running.
- wait_for_job returns 0 on success and 1 on failure, prints completion/failure messages, and calls print_failed_tasks_and_logs on failure instead of exiting.
- Main flow counts TOTAL_ALLOCATIONS and FAILED_ALLOCATIONS, marks allocations failed when wait_for_job returns non-zero, and exits non-zero only if all allocations failed.
- Per-allocation messaging and overall summary/control flow updated to reflect aggregated results.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify callers (CI or other scripts) handle the script's new return codes instead of relying on process exit.
  • Confirm is_running change doesn't break any consumers expecting other states as "running".
  • Validate aggregation/counting logic and final exit condition.
  • Review print_failed_tasks_and_logs output for clarity and side effects.

Possibly related PRs

Suggested reviewers

  • veeso
  • jost-s
  • cdunster

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: treating failed Nomad allocations as completed rather than as errors.
Description check ✅ Passed The description includes a complete summary explaining the problem, solution, and rationale. Both TODO items from the template are present and checked/addressed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/treat-failed-allocations-as-complete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

The following will be added to the changelog


[0.5.0-dev.1] - 2025-12-19

Bug Fixes

  • Treat failed Nomad allocations as completed
    • Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running.
    • This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nomad/scripts/wait_for_jobs.sh (1)

29-42: get_status() uses $alloc_id without defining/passing it (can break under set -u in some paths).
Today it “works” only because alloc_id happens to exist globally in the main loop, but this is brittle and can produce misleading error messages.

Suggested fix:

-function get_status() {
-    local nomad_status="$1"
+function get_status() {
+    local alloc_id="$1"
+    local nomad_status="$2"
     # Fetch JSON and extract ClientStatus; fail fast if the command errors.
     local client_status
     if ! client_status="$(echo "$nomad_status" | jq -r '.ClientStatus')"; then
         echo "Failed to retrieve status for allocation ID: $alloc_id" >&2
         exit 1
     fi
@@
     echo "$client_status"
 }

And update the call site:

-        status="$(get_status "$nomad_status")"
+        status="$(get_status "$alloc_id" "$nomad_status")"
🧹 Nitpick comments (3)
nomad/scripts/wait_for_jobs.sh (3)

117-125: Treating failed allocations as “completed” matches the goal; tighten the failure path for robustness.
Now that failure is a “successful” exit path, please make sure the logs are consistently useful and don’t accidentally trip set -e. Concretely:

  • Consider prefixing failure log lines with a clear marker (so the summariser can reliably detect them).
  • Consider explicitly handling print_failed_tasks_and_logs errors so a logging failure doesn’t kill the script.

Proposed tweak (keeps semantics but avoids logging causing an exit):

-            print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
+            print_failed_tasks_and_logs "$alloc_id" "$nomad_status" || \
+              echo "Warning: failed to print task failures/logs for $scenario_name ($alloc_id)" >&2

59-87: Avoid global leakage: make failed_tasks local; consider redacting/noising-sensitive logs.
Because failures are now treated as completion, this function will run more often; avoid polluting global state (failed_tasks is currently global). Also, nomad alloc logs output can include secrets depending on workloads—worth an explicit acknowledgement/guardrail if this runs in CI logs.

 function print_failed_tasks_and_logs() {
   local alloc_id="$1"
   local status="$2"
@@
-  failed_tasks=$(jq -r '
+  local failed_tasks
+  failed_tasks=$(jq -r '
@@
   echo "Failed task: $failed_tasks"
 }

89-129: Scope ELAPSED_SECS as local to wait_for_job().
Prevents accidental reuse/corruption if this script grows.

-    ELAPSED_SECS=0
+    local ELAPSED_SECS=0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 234dbbf and 774e61e.

📒 Files selected for processing (1)
  • nomad/scripts/wait_for_jobs.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (macos-latest)
  • GitHub Check: test (ubuntu-latest)

else
echo "Scenario $scenario_name ($alloc_id) finished with status=$status after $ELAPSED_SECS seconds."
print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
echo "Note: Failed allocations are treated as completed. The summariser will catch any issues."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summariser will catch any issues

That statement sounds a bit absolute to me. Are sure that the summariser would really catch any issues? If a runner dies half-way for example this may not show up as a summariser failure I think, would it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rabbit picked up on this comment and I responded below. It's not that the summariser catches this in a try/catch/retry kind of sense. We do report the number of agents that ran to completion, which is supposed to be tied to these numbers. So if 1/10 succeeded in running to completion then we would be able to see that in the data.

matthme
matthme previously approved these changes Dec 15, 2025
Copy link
Contributor

@matthme matthme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay to me but I have a question regarding whether we really catch all error cases still.

@ThetaSinner ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch 2 times, most recently from 9d64836 to f7e6630 Compare December 16, 2025 13:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)

166-171: Consider adding summary logging for partial failures.

The exit logic correctly implements the all-or-nothing failure strategy described in the PR objectives. However, when some (but not all) allocations fail, the script exits successfully with no summary message about the partial failure.

Consider adding a summary before line 172:

+# Log summary of results
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS out of $TOTAL_ALLOCATIONS allocation(s) failed, but continuing as not all failed." >&2
+fi
+
 # If all allocations failed, exit with error
 if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
     echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi

 exit 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d64836 and f7e6630.

📒 Files selected for processing (1)
  • nomad/scripts/wait_for_jobs.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test (macos-latest)
🔇 Additional comments (2)
nomad/scripts/wait_for_jobs.sh (2)

118-127: LGTM! Correctly changed exit behavior to return codes.

The change from break/exit to return 0/return 1 properly allows the script to continue processing remaining allocations after encountering a failed one, which is the core objective of this PR.


150-161: LGTM! Aggregation logic is correct.

The tracking of total and failed allocations is implemented correctly and integrates well with the per-allocation return codes from wait_for_job.

Comment on lines 44 to 53
# See https://developer.hashicorp.com/nomad/commands/job/status#running
function is_running() {
local status="$1"
case "$status" in
complete|failed|lost|stopped|dead|unknown)
return 1 ;; # not running
*)
running)
return 0 ;; # still running (e.g., pending, running)
*)
return 1 ;; # not running
esac
}
Copy link

@coderabbitai coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n nomad/scripts/wait_for_jobs.sh | head -100

Repository: holochain/wind-tunnel

Length of output: 3359


🏁 Script executed:

cat -n nomad/scripts/wait_for_jobs.sh | sed -n '100,180p'

Repository: holochain/wind-tunnel

Length of output: 2786


Fix handling of "pending" status in is_running() function.

The is_running() function on line 48 only checks for "running" status, but Nomad allocations can also be in "pending" state when waiting to be scheduled. When an allocation is pending, the function returns 1, causing the wait loop to exit and treat the allocation as completed. Since pending ≠ complete, it gets marked as failed.

Additionally, the comment on line 49 states "still running (e.g., pending, running)" but the code no longer handles "pending" as running, creating an inconsistency.

Update the function to treat both pending and running as active states:

 function is_running() {
     local status="$1"
     case "$status" in
-      running)
-        return 0 ;;  # still running (e.g., pending, running)
+      pending|running)
+        return 0 ;;  # still running
       *)
         return 1 ;;  # not running
     esac
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# See https://developer.hashicorp.com/nomad/commands/job/status#running
function is_running() {
local status="$1"
case "$status" in
complete|failed|lost|stopped|dead|unknown)
return 1 ;; # not running
*)
running)
return 0 ;; # still running (e.g., pending, running)
*)
return 1 ;; # not running
esac
}
# See https://developer.hashicorp.com/nomad/commands/job/status#running
function is_running() {
local status="$1"
case "$status" in
pending|running)
return 0 ;; # still running
*)
return 1 ;; # not running
esac
}
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 44 to 53, the is_running()
function only treats "running" as active while the comment claims "pending"
should also be considered active; update the case pattern to treat both
"pending" and "running" as active states (return 0) and adjust the inline
comment to match, so the wait loop continues when allocations are pending or
running and only exits for non-active statuses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the latest documentation, the status returned by the CLI is simpler than that https://developer.hashicorp.com/nomad/commands/job/status

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

Nomad alloc status ClientStatus values

💡 Result:

Common ClientStatus values seen on Nomad allocations include: pending, running, complete, failed, lost, and unknown. These appear in the API/CLI outputs for allocations (e.g., /v1/allocations, nomad job allocs and operator client-state examples). [1][2]

Sources:

  • Nomad API — Allocations (examples showing ClientStatus = "running" / "pending"). [1]
  • nomad job allocs / operator client-state documentation (CLI examples showing ClientStatus values such as "pending", "running", "complete"). [2]

🌐 Web query:

Nomad alloc status vs job status ClientStatus allocation

💡 Result:

Short answer — they’re different scopes:

  • Job status: a high-level view of the job as submitted to Nomad (CLI values: running or dead; UI shows richer states like Healthy, Recovering, Stopped, Scaled Down). It summarizes how the job’s allocations compare to the desired state. [1][2]

  • Allocation (alloc) ClientStatus: the emergent, per-allocation runtime state reported by the client for a specific allocation (common values: pending, running, complete, failed, lost). Allocation objects also include Desired Status (the scheduler’s goal for that alloc, e.g., run or stop). Use nomad alloc status to inspect an alloc’s ClientStatus and task events. [3][1]

Practical mapping:

  • A job can be “running” while some of its allocations are “pending” (being placed/restarted) or some are “failed” — the job status reflects the overall picture across its allocations, while ClientStatus is the concrete state of one allocation on a client. [2][4]

References:
[1] Nomad alloc status (command reference).
[2] Nomad Job concepts / job status.
[3] Node/alloc Client Status description (Desired vs Client Status).
[4] Job status / allocation examples showing pending vs running.


@ThetaSinner, I appreciate you checking the documentation! However, there's an important distinction here:

The script uses nomad alloc status (line 27) and extracts .ClientStatus (line 34), which refers to allocation status, not job status. These are different:

  • Job status (from nomad job status): High-level view with simpler values like "running" or "dead" [as shown in the link you provided]
  • Allocation ClientStatus (from nomad alloc status): Per-allocation runtime state with values including: pending, running, complete, failed, lost, and unknown [1][2]

According to the Nomad documentation, allocations can be in a pending state when they're waiting to be scheduled or placed on a client node. With the current code only checking for running, any allocation in pending state would be treated as "not running" and potentially marked as failed even though it's still active.

Could you verify in your testing whether you've observed allocations in pending state, or if they transition directly from another state to running? If allocations do pass through pending, the function should handle it to avoid premature failure detection.

@ThetaSinner ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from f7e6630 to 7de0e33 Compare December 16, 2025 22:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
nomad/scripts/wait_for_jobs.sh (1)

44-53: Comment-code mismatch: "pending" mentioned but not handled.

The comment on line 49 states "(e.g., pending, running)" but only running is matched. If an allocation is in pending state (waiting to be scheduled), is_running() returns 1, causing the wait loop to exit and treat the allocation as failed since it's not "complete".

Either update the case to include pending, or correct the comment if the intent is to treat pending as not-running:

 function is_running() {
     local status="$1"
     case "$status" in
-      running)
-        return 0 ;;  # still running (e.g., pending, running)
+      pending|running)
+        return 0 ;;  # still running
       *)
         return 1 ;;  # not running
     esac
 }
🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)

166-172: Consider a minimum success threshold.

Currently, the script succeeds if at least one allocation completes successfully. This means a run with 9/10 failed allocations still exits 0, potentially masking infrastructure issues.

Consider adding a configurable threshold (e.g., at least 50% must succeed):

+# Minimum percentage of allocations that must succeed (default: 50%)
+MIN_SUCCESS_PERCENT="${MIN_SUCCESS_PERCENT:-50}"
+
 # If all allocations failed, exit with error
-if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
+SUCCEEDED=$((TOTAL_ALLOCATIONS - FAILED_ALLOCATIONS))
+SUCCESS_PERCENT=$((SUCCEEDED * 100 / TOTAL_ALLOCATIONS))
+
+if [[ $SUCCESS_PERCENT -lt $MIN_SUCCESS_PERCENT ]]; then
+    echo "Error: Only $SUCCEEDED/$TOTAL_ALLOCATIONS allocation(s) succeeded ($SUCCESS_PERCENT% < $MIN_SUCCESS_PERCENT% threshold)." >&2
-    echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi
+
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS/$TOTAL_ALLOCATIONS allocation(s) failed, but success threshold met."
+fi
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7e6630 and 7de0e33.

📒 Files selected for processing (1)
  • nomad/scripts/wait_for_jobs.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

  • nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (2)
nomad/scripts/wait_for_jobs.sh (2)

118-126: LGTM!

Good refactoring to use return codes instead of exiting, enabling proper aggregation of allocation results. Failed allocations are logged with details before returning.


150-164: LGTM!

The allocation counting logic is correct. Using if ! wait_for_job properly handles the return code in a conditional context, avoiding issues with set -e.

@ThetaSinner ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 7de0e33 to 6586d18 Compare December 17, 2025 10:38
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nomad/scripts/wait_for_jobs.sh (1)

166-172: Consider logging a summary when some (but not all) allocations failed.

When partial failures occur, the script exits 0 silently. Adding a summary line would improve observability without changing the exit behavior:

 # If all allocations failed, exit with error
 if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
     echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi
+
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS of $TOTAL_ALLOCATIONS allocation(s) failed. Proceeding to summariser."
+fi

 exit 0
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7de0e33 and 6586d18.

📒 Files selected for processing (1)
  • nomad/scripts/wait_for_jobs.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

  • nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test (macos-latest)
🔇 Additional comments (3)
nomad/scripts/wait_for_jobs.sh (3)

118-127: LGTM!

The refactored completion handling correctly returns 0/1 instead of exiting, enabling proper aggregation of results. Failed allocations are still logged with details before returning.


150-152: LGTM!

Counter initialization is correctly placed after the shift that removes the scenario name, so $# accurately reflects the allocation count.


155-164: LGTM!

The if ! pattern correctly handles the non-zero return from wait_for_job under set -e, and the failure counting logic is sound.

@ThetaSinner ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 6586d18 to 5543323 Compare December 19, 2025 11:49
Previously, the wait_for_jobs.sh script would exit with an error code
when an allocation reached a "failed" state, causing the CI pipeline to
fail. This was problematic because machines can lose connectivity or
power during scenario runs, resulting in failed allocations that have
still finished running.

This change treats "failed" allocations as completed states. The script
now logs the failure details but continues successfully, allowing the
downstream summariser to catch any real issues. This prevents transient
infrastructure problems from unnecessarily failing the entire CI run.
@ThetaSinner ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 5543323 to 2a04d3b Compare December 19, 2025 11:51
@cocogitto-bot
Copy link

cocogitto-bot bot commented Dec 19, 2025

✔️ 2a04d3b - Conventional commits check succeeded.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6586d18 and 2a04d3b.

📒 Files selected for processing (1)
  • nomad/scripts/wait_for_jobs.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.
📚 Learning: 2025-11-20T22:30:58.294Z
Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

  • nomad/scripts/wait_for_jobs.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test (macos-latest)
🔇 Additional comments (1)
nomad/scripts/wait_for_jobs.sh (1)

118-126: LGTM! Return values enable proper aggregation.

The change from exit to return is correct and necessary for the new partial-failure handling. Both success and failure paths properly log allocation status and return appropriate codes (0 for success, 1 for failure), allowing the main loop to aggregate results across all allocations.

Comment on lines +44 to 53
# See https://developer.hashicorp.com/nomad/commands/job/status#running
function is_running() {
local status="$1"
case "$status" in
complete|failed|lost|stopped|dead|unknown)
return 1 ;; # not running
running)
return 0 ;; # still running
*)
return 0 ;; # still running (e.g., pending, running)
return 1 ;; # not running
esac
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The is_running() function should treat "pending" as an active state.

Nomad allocations can have a ClientStatus of "pending", which represents allocations waiting to be scheduled on a client node. The official Nomad documentation shows examples of allocations with DesiredStatus="run" and ClientStatus="pending" as valid intermediate states.

The current implementation only returns 0 for "running", treating "pending" allocations as not running. This causes:

  1. Loop exits when allocation enters "pending" state
  2. is_run_success("pending") returns false
  3. Allocation marked as failed

HashiCorp's own documentation recommends filtering for allocations with ClientStatus of "running" or "pending", which contradicts the current implementation that only checks "running".

Update the function to handle both states:

function is_running() {
    local status="$1"
    case "$status" in
      running|pending)
        return 0 ;;
      *)
        return 1 ;;
    esac
}
🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 44-53, the is_running() helper
currently returns success only for "running", treating "pending" as not active
and causing loops to exit and allocations to be marked failed; update the case
pattern so that both "running" and "pending" return 0 (i.e., treat pending as an
active intermediate state), leaving other statuses to return 1; after this
change, ensure any callers (e.g., is_run_success) rely on is_running() semantics
or are adjusted if they separately check for "pending".

Comment on lines +166 to +171
# If all allocations failed, exit with error
if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
exit 1
fi

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "ci_allocs.sh" -o -name "wait_for_jobs.sh" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 126


🏁 Script executed:

cat nomad/scripts/ci_allocs.sh | head -100

Repository: holochain/wind-tunnel

Length of output: 3424


🏁 Script executed:

grep -n "peer_end_count\|peer_count\|ClientStatus\|failed.*alloc" nomad/scripts/ci_allocs.sh -A 3 -B 3

Repository: holochain/wind-tunnel

Length of output: 656


🏁 Script executed:

grep -n "make_allocs_csv\|generate_run_summary" nomad/scripts/ci_allocs.sh -A 10 -B 5

Repository: holochain/wind-tunnel

Length of output: 1446


🏁 Script executed:

grep -n "make_allocs_csv\|alloc_ids" nomad/scripts/wait_for_jobs.sh -A 5 -B 5

Repository: holochain/wind-tunnel

Length of output: 369


🏁 Script executed:

sed -n '146,180p' nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 735


🏁 Script executed:

rg "wait_for_jobs" --type sh -A 5 -B 5 | head -80

Repository: holochain/wind-tunnel

Length of output: 47


🏁 Script executed:

grep -rn "wait_for_jobs\|nomad/scripts/wait_for_jobs.sh" . --include="*.sh" --include="*.nomad" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 47


🏁 Script executed:

find . -name "*.nomad" -o -name "*.hcl" 2>/dev/null | xargs grep -l "wait_for_jobs" 2>/dev/null

Repository: holochain/wind-tunnel

Length of output: 49


🏁 Script executed:

cat nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 4779


🏁 Script executed:

grep -rn "make_allocs_csv\|generate_run_summary" . --include="*.sh" --include="*.nomad" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 574


🏁 Script executed:

sed -n '25,50p' nomad/scripts/ci_allocs.sh

Repository: holochain/wind-tunnel

Length of output: 980


🏁 Script executed:

grep -rn "make_allocs_csv\|alloc_ids" . --include="*.sh" -B 10 -A 2 | grep -A 15 "make_allocs_csv"

Repository: holochain/wind-tunnel

Length of output: 2840


🏁 Script executed:

grep -rn "wait_for_jobs\|make_allocs_csv" . --include="*.sh" --include="*.yml" --include="*.yaml" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 772


🏁 Script executed:

sed -n '155,170p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 799


🏁 Script executed:

sed -n '125,160p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 1850


Orchestration does not filter failed allocations; peer_end_count is set statically from config.

The risk scenario in your review is valid. In ci_allocs.sh, the generate_run_summary function sets both peer_count and peer_end_count to the behavior count from the job config (line 105: peer_end_count: ($bs | length)), regardless of allocation completion status.

The allocation IDs passed to make_allocs_csv come directly from Nomad's job run output without status filtering. If an allocation is created by Nomad but fails mid-execution, it remains in the alloc_ids list and gets written to the CSV. Since peer_end_count is derived statically from the configured behaviors array—not from actual work completion—the summarizer cannot detect if work was incomplete.

The script's assumption that the downstream summarizer will detect partial failures via peer_end_count is unfounded: that metric reflects only the configured behavior count, not the actual completion count. If allocations fail before completing their work, the summarizer has no signal to detect this.

🤖 Prompt for AI Agents
In nomad/scripts/wait_for_jobs.sh around lines 166-171, the script exits only on
all-allocations-failed but downstream summarization uses a statically set
peer_end_count from job config; to fix, change the pipeline that builds
alloc_ids and generates the run summary so peer_end_count is derived from actual
completed/healthy allocations rather than the configured behaviors count: query
Nomad for each allocation's status (e.g., running/complete/failed) after job
run, filter alloc_ids to include only allocations that reached a
successful/completed state (or add a status column to the CSV), and set
peer_end_count to the count of those successful allocations (or completed tasks)
before calling generate_run_summary/make_allocs_csv so partial failures are
reflected accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants