fix: treat failed Nomad allocations as completed by ThetaSinner · Pull Request #401 · holochain/wind-tunnel

ThetaSinner · 2025-12-12T16:12:18Z

Summary

Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running.

This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.

TODO:

All code changes are reflected in docs, including module-level docs
I ran the Nomad CI workflow successfully on my branch

Note that all commits in a PR must follow Conventional Commits before it can be merged, as these are used to generate the changelog

Summary by CodeRabbit

Bug Fixes
- Only the "running" status is treated as active; other statuses are considered not running.
- Per-allocation waits now report clear completion or failure messages and capture failed-task details for troubleshooting.
- Tracks total vs failed allocations, marks individual failures, and only exits with error when all allocations fail; otherwise proceeds normally.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-12T16:12:30Z

Walkthrough

The nomad/scripts/wait_for_jobs.sh script was updated to treat only "running" as running, make wait_for_job return status codes (0/1) instead of exiting, and aggregate per-allocation results to exit non-zero only if every allocation failed.

Changes

Cohort / File(s)	Summary
Nomad wait script `nomad/scripts/wait_for_jobs.sh`	- `is_running` now considers only `"running"` as running. - `wait_for_job` returns `0` on success and `1` on failure, prints completion/failure messages, and calls `print_failed_tasks_and_logs` on failure instead of exiting. - Main flow counts `TOTAL_ALLOCATIONS` and `FAILED_ALLOCATIONS`, marks allocations failed when `wait_for_job` returns non-zero, and exits non-zero only if all allocations failed. - Per-allocation messaging and overall summary/control flow updated to reflect aggregated results.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify callers (CI or other scripts) handle the script's new return codes instead of relying on process exit.
Confirm is_running change doesn't break any consumers expecting other states as "running".
Validate aggregation/counting logic and final exit condition.
Review print_failed_tasks_and_logs output for clarity and side effects.

Possibly related PRs

ci: Wait for Nomad CI jobs to run #245 — Modifies is_running / wait_for_job semantics in the same script.
ci: parallelise the running of nomad jobs #272 — Adjusts allocation handling and invocation patterns that align with per-allocation return/aggregation behavior.
feat: split wait for jobs per scenario #312 — Refactors job-waiting flow to handle per-allocation success/failure and aggregation.

Suggested reviewers

veeso
jost-s
cdunster

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: treating failed Nomad allocations as completed rather than as errors.
Description check	✅ Passed	The description includes a complete summary explaining the problem, solution, and rationale. Both TODO items from the template are present and checked/addressed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/treat-failed-allocations-as-complete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-12-12T16:12:34Z

The following will be added to the changelog

[0.5.0-dev.1] - 2025-12-19

Bug Fixes

Treat failed Nomad allocations as completed
- Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running.
- This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nomad/scripts/wait_for_jobs.sh (1)
29-42: get_status() uses $alloc_id without defining/passing it (can break under set -u in some paths).
Today it “works” only because alloc_id happens to exist globally in the main loop, but this is brittle and can produce misleading error messages.

Suggested fix:
-function get_status() {
-    local nomad_status="$1"
+function get_status() {
+    local alloc_id="$1"
+    local nomad_status="$2"
     # Fetch JSON and extract ClientStatus; fail fast if the command errors.
     local client_status
     if ! client_status="$(echo "$nomad_status" | jq -r '.ClientStatus')"; then
         echo "Failed to retrieve status for allocation ID: $alloc_id" >&2
         exit 1
     fi
@@
     echo "$client_status"
 }
And update the call site:
-        status="$(get_status "$nomad_status")"
+        status="$(get_status "$alloc_id" "$nomad_status")"

🧹 Nitpick comments (3)

nomad/scripts/wait_for_jobs.sh (3)
117-125: Treating failed allocations as “completed” matches the goal; tighten the failure path for robustness.
Now that failure is a “successful” exit path, please make sure the logs are consistently useful and don’t accidentally trip set -e. Concretely:

Consider prefixing failure log lines with a clear marker (so the summariser can reliably detect them).

Consider explicitly handling print_failed_tasks_and_logs errors so a logging failure doesn’t kill the script.

Proposed tweak (keeps semantics but avoids logging causing an exit):
-            print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
+            print_failed_tasks_and_logs "$alloc_id" "$nomad_status" || \
+              echo "Warning: failed to print task failures/logs for $scenario_name ($alloc_id)" >&2
59-87: Avoid global leakage: make failed_tasks local; consider redacting/noising-sensitive logs.
Because failures are now treated as completion, this function will run more often; avoid polluting global state (failed_tasks is currently global). Also, nomad alloc logs output can include secrets depending on workloads—worth an explicit acknowledgement/guardrail if this runs in CI logs.
 function print_failed_tasks_and_logs() {
   local alloc_id="$1"
   local status="$2"
@@
-  failed_tasks=$(jq -r '
+  local failed_tasks
+  failed_tasks=$(jq -r '
@@
   echo "Failed task: $failed_tasks"
 }
89-129: Scope ELAPSED_SECS as local to wait_for_job().
Prevents accidental reuse/corruption if this script grows.
-    ELAPSED_SECS=0
+    local ELAPSED_SECS=0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 234dbbf and 774e61e.

📒 Files selected for processing (1)

nomad/scripts/wait_for_jobs.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (macos-latest)
GitHub Check: test (ubuntu-latest)

matthme · 2025-12-15T10:00:24Z

nomad/scripts/wait_for_jobs.sh

+        else
+            echo "Scenario $scenario_name ($alloc_id) finished with status=$status after $ELAPSED_SECS seconds."
+            print_failed_tasks_and_logs "$alloc_id" "$nomad_status"
+            echo "Note: Failed allocations are treated as completed. The summariser will catch any issues."


The summariser will catch any issues

That statement sounds a bit absolute to me. Are sure that the summariser would really catch any issues? If a runner dies half-way for example this may not show up as a summariser failure I think, would it?

The rabbit picked up on this comment and I responded below. It's not that the summariser catches this in a try/catch/retry kind of sense. We do report the number of agents that ran to completion, which is supposed to be tied to these numbers. So if 1/10 succeeded in running to completion then we would be able to see that in the data.

matthme

It looks okay to me but I have a question regarding whether we really catch all error cases still.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

nomad/scripts/wait_for_jobs.sh (1)
166-171: Consider adding summary logging for partial failures.

The exit logic correctly implements the all-or-nothing failure strategy described in the PR objectives. However, when some (but not all) allocations fail, the script exits successfully with no summary message about the partial failure.

Consider adding a summary before line 172:
+# Log summary of results
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS out of $TOTAL_ALLOCATIONS allocation(s) failed, but continuing as not all failed." >&2
+fi
+
 # If all allocations failed, exit with error
 if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
     echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi

 exit 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d64836 and f7e6630.

📒 Files selected for processing (1)

nomad/scripts/wait_for_jobs.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (ubuntu-latest)
GitHub Check: test (macos-latest)

🔇 Additional comments (2)

nomad/scripts/wait_for_jobs.sh (2)

118-127: LGTM! Correctly changed exit behavior to return codes.

The change from break/exit to return 0/return 1 properly allows the script to continue processing remaining allocations after encountering a failed one, which is the core objective of this PR.

150-161: LGTM! Aggregation logic is correct.

The tracking of total and failed allocations is implemented correctly and integrates well with the per-allocation return codes from wait_for_job.

coderabbitai · 2025-12-16T13:13:39Z

nomad/scripts/wait_for_jobs.sh

+# See https://developer.hashicorp.com/nomad/commands/job/status#running
 function is_running() {
    local status="$1"
    case "$status" in
-      complete|failed|lost|stopped|dead|unknown)
-        return 1 ;;  # not running
-      *)
+      running)
        return 0 ;;  # still running (e.g., pending, running)
+      *)
+        return 1 ;;  # not running
    esac
 }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n nomad/scripts/wait_for_jobs.sh | head -100

Repository: holochain/wind-tunnel

Length of output: 3359

🏁 Script executed:

cat -n nomad/scripts/wait_for_jobs.sh | sed -n '100,180p'

Repository: holochain/wind-tunnel

Length of output: 2786

Fix handling of "pending" status in is_running() function.

The is_running() function on line 48 only checks for "running" status, but Nomad allocations can also be in "pending" state when waiting to be scheduled. When an allocation is pending, the function returns 1, causing the wait loop to exit and treat the allocation as completed. Since pending ≠ complete, it gets marked as failed.

Additionally, the comment on line 49 states "still running (e.g., pending, running)" but the code no longer handles "pending" as running, creating an inconsistency.

Update the function to treat both pending and running as active states:

function is_running() { local status="$1" case "$status" in - running) - return 0 ;; # still running (e.g., pending, running) + pending|running) + return 0 ;; # still running *) return 1 ;; # not running esac }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# See https://developer.hashicorp.com/nomad/commands/job/status#running

function is_running() {

local status="$1"

case "$status" in

complete|failed|lost|stopped|dead|unknown)

return 1 ;; # not running

*)

running)

return 0 ;; # still running (e.g., pending, running)

*)

return 1 ;; # not running

esac

}

# See https://developer.hashicorp.com/nomad/commands/job/status#running

function is_running() {

local status="$1"

case "$status" in

pending|running)

return 0 ;; # still running

*)

return 1 ;; # not running

esac

}

🤖 Prompt for AI Agents

In nomad/scripts/wait_for_jobs.sh around lines 44 to 53, the is_running() function only treats "running" as active while the comment claims "pending" should also be considered active; update the case pattern to treat both "pending" and "running" as active states (return 0) and adjust the inline comment to match, so the wait loop continues when allocations are pending or running and only exits for non-active statuses.

According to the latest documentation, the status returned by the CLI is simpler than that https://developer.hashicorp.com/nomad/commands/job/status

🧩 Analysis chain

🌐 Web query:

Nomad alloc status ClientStatus values

💡 Result:

Common ClientStatus values seen on Nomad allocations include: pending, running, complete, failed, lost, and unknown. These appear in the API/CLI outputs for allocations (e.g., /v1/allocations, nomad job allocs and operator client-state examples). [1][2]

Sources:

Nomad API — Allocations (examples showing ClientStatus = "running" / "pending"). [1]

nomad job allocs / operator client-state documentation (CLI examples showing ClientStatus values such as "pending", "running", "complete"). [2]

🌐 Web query:

Nomad alloc status vs job status ClientStatus allocation

💡 Result:

Short answer — they’re different scopes:

Job status: a high-level view of the job as submitted to Nomad (CLI values: running or dead; UI shows richer states like Healthy, Recovering, Stopped, Scaled Down). It summarizes how the job’s allocations compare to the desired state. [1][2]

Allocation (alloc) ClientStatus: the emergent, per-allocation runtime state reported by the client for a specific allocation (common values: pending, running, complete, failed, lost). Allocation objects also include Desired Status (the scheduler’s goal for that alloc, e.g., run or stop). Use nomad alloc status to inspect an alloc’s ClientStatus and task events. [3][1]

Practical mapping:

A job can be “running” while some of its allocations are “pending” (being placed/restarted) or some are “failed” — the job status reflects the overall picture across its allocations, while ClientStatus is the concrete state of one allocation on a client. [2][4]

References:
[1] Nomad alloc status (command reference).
[2] Nomad Job concepts / job status.
[3] Node/alloc Client Status description (Desired vs Client Status).
[4] Job status / allocation examples showing pending vs running.

@ThetaSinner, I appreciate you checking the documentation! However, there's an important distinction here:

The script uses nomad alloc status (line 27) and extracts .ClientStatus (line 34), which refers to allocation status, not job status. These are different:

Job status (from nomad job status): High-level view with simpler values like "running" or "dead" [as shown in the link you provided]

Allocation ClientStatus (from nomad alloc status): Per-allocation runtime state with values including: pending, running, complete, failed, lost, and unknown [1][2]

According to the Nomad documentation, allocations can be in a pending state when they're waiting to be scheduled or placed on a client node. With the current code only checking for running, any allocation in pending state would be treated as "not running" and potentially marked as failed even though it's still active.

Could you verify in your testing whether you've observed allocations in pending state, or if they transition directly from another state to running? If allocations do pass through pending, the function should handle it to avoid premature failure detection.

nomad/scripts/wait_for_jobs.sh

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

nomad/scripts/wait_for_jobs.sh (1)
44-53: Comment-code mismatch: "pending" mentioned but not handled.

The comment on line 49 states "(e.g., pending, running)" but only running is matched. If an allocation is in pending state (waiting to be scheduled), is_running() returns 1, causing the wait loop to exit and treat the allocation as failed since it's not "complete".

Either update the case to include pending, or correct the comment if the intent is to treat pending as not-running:
 function is_running() {
     local status="$1"
     case "$status" in
-      running)
-        return 0 ;;  # still running (e.g., pending, running)
+      pending|running)
+        return 0 ;;  # still running
       *)
         return 1 ;;  # not running
     esac
 }

🧹 Nitpick comments (1)

nomad/scripts/wait_for_jobs.sh (1)

166-172: Consider a minimum success threshold.

Currently, the script succeeds if at least one allocation completes successfully. This means a run with 9/10 failed allocations still exits 0, potentially masking infrastructure issues.

Consider adding a configurable threshold (e.g., at least 50% must succeed):

+# Minimum percentage of allocations that must succeed (default: 50%)
+MIN_SUCCESS_PERCENT="${MIN_SUCCESS_PERCENT:-50}"
+
 # If all allocations failed, exit with error
-if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
+SUCCEEDED=$((TOTAL_ALLOCATIONS - FAILED_ALLOCATIONS))
+SUCCESS_PERCENT=$((SUCCEEDED * 100 / TOTAL_ALLOCATIONS))
+
+if [[ $SUCCESS_PERCENT -lt $MIN_SUCCESS_PERCENT ]]; then
+    echo "Error: Only $SUCCEEDED/$TOTAL_ALLOCATIONS allocation(s) succeeded ($SUCCESS_PERCENT% < $MIN_SUCCESS_PERCENT% threshold)." >&2
-    echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi
+
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS/$TOTAL_ALLOCATIONS allocation(s) failed, but success threshold met."
+fi

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7e6630 and 7de0e33.

📒 Files selected for processing (1)

nomad/scripts/wait_for_jobs.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.

📚 Learning: 2025-11-20T22:30:58.294Z

Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

nomad/scripts/wait_for_jobs.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test (ubuntu-latest)

🔇 Additional comments (2)

nomad/scripts/wait_for_jobs.sh (2)

118-126: LGTM!

Good refactoring to use return codes instead of exiting, enabling proper aggregation of allocation results. Failed allocations are logged with details before returning.

150-164: LGTM!

The allocation counting logic is correct. Using if ! wait_for_job properly handles the return code in a conditional context, avoiding issues with set -e.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nomad/scripts/wait_for_jobs.sh (1)
166-172: Consider logging a summary when some (but not all) allocations failed.

When partial failures occur, the script exits 0 silently. Adding a summary line would improve observability without changing the exit behavior:
 # If all allocations failed, exit with error
 if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
     echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
     exit 1
 fi
+
+if [[ $FAILED_ALLOCATIONS -gt 0 ]]; then
+    echo "Warning: $FAILED_ALLOCATIONS of $TOTAL_ALLOCATIONS allocation(s) failed. Proceeding to summariser."
+fi

 exit 0

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7de0e33 and 6586d18.

📒 Files selected for processing (1)

nomad/scripts/wait_for_jobs.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.

📚 Learning: 2025-11-20T22:30:58.294Z

Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

nomad/scripts/wait_for_jobs.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (ubuntu-latest)
GitHub Check: test (macos-latest)

🔇 Additional comments (3)

nomad/scripts/wait_for_jobs.sh (3)

118-127: LGTM!

The refactored completion handling correctly returns 0/1 instead of exiting, enabling proper aggregation of results. Failed allocations are still logged with details before returning.

150-152: LGTM!

Counter initialization is correctly placed after the shift that removes the scenario name, so $# accurately reflects the allocation count.

155-164: LGTM!

The if ! pattern correctly handles the non-zero return from wait_for_job under set -e, and the failure counting logic is sound.

nomad/scripts/wait_for_jobs.sh

Previously, the wait_for_jobs.sh script would exit with an error code when an allocation reached a "failed" state, causing the CI pipeline to fail. This was problematic because machines can lose connectivity or power during scenario runs, resulting in failed allocations that have still finished running. This change treats "failed" allocations as completed states. The script now logs the failure details but continues successfully, allowing the downstream summariser to catch any real issues. This prevents transient infrastructure problems from unnecessarily failing the entire CI run.

cocogitto-bot · 2025-12-19T11:51:15Z

✔️ 2a04d3b - Conventional commits check succeeded.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6586d18 and 2a04d3b.

📒 Files selected for processing (1)

nomad/scripts/wait_for_jobs.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: veeso
Repo: holochain/wind-tunnel PR: 245
File: .github/workflows/nomad.yaml:101-110
Timestamp: 2025-08-12T12:34:18.851Z
Learning: When extracting multiple allocation IDs from Nomad job output in wind-tunnel, use `grep -oE` with extended regex instead of `grep -oP` with PCRE for better portability. The pattern should capture full hex IDs including hyphens: `[0-9a-f-]+`. AWK-based solutions may not properly handle multiple allocation IDs across multiple lines.

📚 Learning: 2025-11-20T22:30:58.294Z

Learnt from: pdaoust
Repo: holochain/wind-tunnel PR: 349
File: summary-visualiser/templates/helpers/scenarioSummary.html.tmpl:19-31
Timestamp: 2025-11-20T22:30:58.294Z
Learning: In the wind-tunnel project's summary visualiser templates (summary-visualiser/templates/), peer_count and peer_end_count will never be 0 in practice, so truthiness checks like `{{ if (index .data "peer_count") }}` are acceptable and don't need to be changed to explicit presence checks.

Applied to files:

nomad/scripts/wait_for_jobs.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (ubuntu-latest)
GitHub Check: test (macos-latest)

🔇 Additional comments (1)

nomad/scripts/wait_for_jobs.sh (1)

118-126: LGTM! Return values enable proper aggregation.

The change from exit to return is correct and necessary for the new partial-failure handling. Both success and failure paths properly log allocation status and return appropriate codes (0 for success, 1 for failure), allowing the main loop to aggregate results across all allocations.

coderabbitai · 2025-12-19T11:56:16Z

nomad/scripts/wait_for_jobs.sh

+# See https://developer.hashicorp.com/nomad/commands/job/status#running
 function is_running() {
    local status="$1"
    case "$status" in
-      complete|failed|lost|stopped|dead|unknown)
-        return 1 ;;  # not running
+      running)
+        return 0 ;;  # still running
      *)
-        return 0 ;;  # still running (e.g., pending, running)
+        return 1 ;;  # not running
    esac
 }


⚠️ Potential issue | 🔴 Critical

The is_running() function should treat "pending" as an active state.

Nomad allocations can have a ClientStatus of "pending", which represents allocations waiting to be scheduled on a client node. The official Nomad documentation shows examples of allocations with DesiredStatus="run" and ClientStatus="pending" as valid intermediate states.

The current implementation only returns 0 for "running", treating "pending" allocations as not running. This causes:

Loop exits when allocation enters "pending" state

is_run_success("pending") returns false

Allocation marked as failed

HashiCorp's own documentation recommends filtering for allocations with ClientStatus of "running" or "pending", which contradicts the current implementation that only checks "running".

Update the function to handle both states:

function is_running() { local status="$1" case "$status" in running|pending) return 0 ;; *) return 1 ;; esac }

🤖 Prompt for AI Agents

In nomad/scripts/wait_for_jobs.sh around lines 44-53, the is_running() helper currently returns success only for "running", treating "pending" as not active and causing loops to exit and allocations to be marked failed; update the case pattern so that both "running" and "pending" return 0 (i.e., treat pending as an active intermediate state), leaving other statuses to return 1; after this change, ensure any callers (e.g., is_run_success) rely on is_running() semantics or are adjusted if they separately check for "pending".

coderabbitai · 2025-12-19T11:56:16Z

nomad/scripts/wait_for_jobs.sh

+# If all allocations failed, exit with error
+if [[ $FAILED_ALLOCATIONS -eq $TOTAL_ALLOCATIONS ]]; then
+    echo "Error: All $TOTAL_ALLOCATIONS allocation(s) failed." >&2
+    exit 1
+fi
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "ci_allocs.sh" -o -name "wait_for_jobs.sh" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 126

🏁 Script executed:

cat nomad/scripts/ci_allocs.sh | head -100

Repository: holochain/wind-tunnel

Length of output: 3424

🏁 Script executed:

grep -n "peer_end_count\|peer_count\|ClientStatus\|failed.*alloc" nomad/scripts/ci_allocs.sh -A 3 -B 3

Repository: holochain/wind-tunnel

Length of output: 656

🏁 Script executed:

grep -n "make_allocs_csv\|generate_run_summary" nomad/scripts/ci_allocs.sh -A 10 -B 5

Repository: holochain/wind-tunnel

Length of output: 1446

🏁 Script executed:

grep -n "make_allocs_csv\|alloc_ids" nomad/scripts/wait_for_jobs.sh -A 5 -B 5

Repository: holochain/wind-tunnel

Length of output: 369

🏁 Script executed:

sed -n '146,180p' nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 735

🏁 Script executed:

rg "wait_for_jobs" --type sh -A 5 -B 5 | head -80

Repository: holochain/wind-tunnel

Length of output: 47

🏁 Script executed:

grep -rn "wait_for_jobs\|nomad/scripts/wait_for_jobs.sh" . --include="*.sh" --include="*.nomad" 2>/dev/null | head -20

Repository: holochain/wind-tunnel

Length of output: 47

🏁 Script executed:

find . -name "*.nomad" -o -name "*.hcl" 2>/dev/null | xargs grep -l "wait_for_jobs" 2>/dev/null

Repository: holochain/wind-tunnel

Length of output: 49

🏁 Script executed:

cat nomad/scripts/wait_for_jobs.sh

Repository: holochain/wind-tunnel

Length of output: 4779

🏁 Script executed:

grep -rn "make_allocs_csv\|generate_run_summary" . --include="*.sh" --include="*.nomad" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 574

🏁 Script executed:

sed -n '25,50p' nomad/scripts/ci_allocs.sh

Repository: holochain/wind-tunnel

Length of output: 980

🏁 Script executed:

grep -rn "make_allocs_csv\|alloc_ids" . --include="*.sh" -B 10 -A 2 | grep -A 15 "make_allocs_csv"

Repository: holochain/wind-tunnel

Length of output: 2840

🏁 Script executed:

grep -rn "wait_for_jobs\|make_allocs_csv" . --include="*.sh" --include="*.yml" --include="*.yaml" 2>/dev/null | grep -v "Binary"

Repository: holochain/wind-tunnel

Length of output: 772

🏁 Script executed:

sed -n '155,170p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 799

🏁 Script executed:

sed -n '125,160p' .github/workflows/nomad.yaml

Repository: holochain/wind-tunnel

Length of output: 1850

Orchestration does not filter failed allocations; peer_end_count is set statically from config.

The risk scenario in your review is valid. In ci_allocs.sh, the generate_run_summary function sets both peer_count and peer_end_count to the behavior count from the job config (line 105: peer_end_count: ($bs | length)), regardless of allocation completion status.

The allocation IDs passed to make_allocs_csv come directly from Nomad's job run output without status filtering. If an allocation is created by Nomad but fails mid-execution, it remains in the alloc_ids list and gets written to the CSV. Since peer_end_count is derived statically from the configured behaviors array—not from actual work completion—the summarizer cannot detect if work was incomplete.

The script's assumption that the downstream summarizer will detect partial failures via peer_end_count is unfounded: that metric reflects only the configured behavior count, not the actual completion count. If allocations fail before completing their work, the summarizer has no signal to detect this.

🤖 Prompt for AI Agents

In nomad/scripts/wait_for_jobs.sh around lines 166-171, the script exits only on all-allocations-failed but downstream summarization uses a statically set peer_end_count from job config; to fix, change the pipeline that builds alloc_ids and generates the run summary so peer_end_count is derived from actual completed/healthy allocations rather than the configured behaviors count: query Nomad for each allocation's status (e.g., running/complete/failed) after job run, filter alloc_ids to include only allocations that reached a successful/completed state (or add a status column to the CSV), and set peer_end_count to the count of those successful allocations (or completed tasks) before calling generate_run_summary/make_allocs_csv so partial failures are reflected accurately.

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

matthme reviewed Dec 15, 2025

View reviewed changes

matthme previously approved these changes Dec 15, 2025

View reviewed changes

ThetaSinner dismissed matthme’s stale review via 9d64836 December 15, 2025 12:35

ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch 2 times, most recently from 9d64836 to f7e6630 Compare December 16, 2025 13:09

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

ThetaSinner commented Dec 16, 2025

View reviewed changes

nomad/scripts/wait_for_jobs.sh Show resolved Hide resolved

ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from f7e6630 to 7de0e33 Compare December 16, 2025 22:04

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 7de0e33 to 6586d18 Compare December 17, 2025 10:38

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

nomad/scripts/wait_for_jobs.sh Outdated Show resolved Hide resolved

ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 6586d18 to 5543323 Compare December 19, 2025 11:49

ThetaSinner force-pushed the fix/treat-failed-allocations-as-complete branch from 5543323 to 2a04d3b Compare December 19, 2025 11:51

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Conversation

ThetaSinner commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TODO:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[0.5.0-dev.1] - 2025-12-19

Bug Fixes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

matthme Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ThetaSinner Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

matthme left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThetaSinner Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cocogitto-bot bot commented Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ThetaSinner commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

github-actions bot commented Dec 12, 2025 •

edited

Loading

coderabbitai bot Dec 16, 2025 •

edited

Loading