Skip to content

fix: safer gpu usage - dropping SLURM env to avoid conflicting vars, …#46

Merged
cmeesters merged 4 commits into
mainfrom
fix/gpu_stabil
May 27, 2026
Merged

fix: safer gpu usage - dropping SLURM env to avoid conflicting vars, …#46
cmeesters merged 4 commits into
mainfrom
fix/gpu_stabil

Conversation

@cmeesters

@cmeesters cmeesters commented May 22, 2026

Copy link
Copy Markdown
Member
  • dropping SLURM env to avoid conflicting vars
  • explicit node requirement to avoid splitting onto several nodes, if not required.

Summary by CodeRabbit

  • Improvements

    • Logs full environment at initialization for easier debugging.
    • Appends GPU and node placement flags to task launches; defaults to one node when unset.
    • Sanitizes inherited SLURM-related environment variables for nested launches.
  • Bug Fixes

    • Stricter validation of integer resource values with clearer error reporting.
  • Tests

    • New unit tests covering GPU/node flag formatting, defaults, and error cases.

Review Change Stack

…explicit node requirement to avoid splitting onto several nodes, if not required.
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f143c520-3208-4c03-a1b5-a304b67e3e92

📥 Commits

Reviewing files that changed from the base of the PR and between 6e505c0 and eadf5a5.

📒 Files selected for processing (2)
  • snakemake_executor_plugin_slurm_jobstep/__init__.py
  • tests/tests.py
💤 Files with no reviewable changes (2)
  • snakemake_executor_plugin_slurm_jobstep/init.py
  • tests/tests.py

📝 Walkthrough

Walkthrough

Adds strict integer validation for GPU/node resource helpers with tests, appends their flags to SMP srun invocations, logs the process environment at executor init, and sanitizes inherited SLURM-related environment variables before launching nested srun processes.

Changes

SLURM Resource Helpers and Environment Sanitization

Layer / File(s) Summary
Resource helper functions and validation
snakemake_executor_plugin_slurm_jobstep/__init__.py, tests/tests.py
Introduces _coerce_int_resource and reimplements get_gpu_setting and get_node_setting to coerce and validate integer resource values, return explicit CLI flags or empty strings, and raise WorkflowError on invalid values. Unit tests using _DummyJob verify flag generation, defaults, disable behavior, and error cases.
srun command construction with GPU and node options
snakemake_executor_plugin_slurm_jobstep/__init__.py
The SMP (non-MPI, non-first-array) srun command now appends GPU and node SLURM options from the refactored helpers.
Environment logging and sanitization
snakemake_executor_plugin_slurm_jobstep/__init__.py
Executor logs full os.environ in __post_init__. Before subprocess.Popen, a sanitized environment is built by retaining only select SLURM allocation/array variables and removing other SLURM_*, SRUN_*, and SBATCH_* entries; the filtered env is passed to Popen.

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title refers to GPU safety improvements and environment variable management, which aligns with the main changes: sanitizing SLURM environment variables and adding GPU/node placement logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gpu_stabil

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm_jobstep/__init__.py (1)

320-348: 💤 Low value

Inconsistent handling of zero/negative values between GPU and node settings.

get_gpu_setting silently returns "" for zero or negative values (lines 330-331), while get_node_setting raises WorkflowError for values ≤ 0 (lines 346-347). This asymmetry may be intentional (allowing GPU disable via gpus: 0 while enforcing valid node counts), but consider whether negative GPU values should also raise an error for consistency and clearer user feedback.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@snakemake_executor_plugin_slurm_jobstep/__init__.py` around lines 320 - 348,
The GPU handling should mirror node validation for negative values while still
allowing zero to explicitly disable GPUs: in get_gpu_setting (using
_coerce_int_resource and job.resources lookups for "gpus","gpu","nvidia_gpu")
explicitly treat None or 0 as the empty string (no --gpus flag) but raise
WorkflowError when gpu_value < 0 with a clear message, instead of silently
returning ""; keep positive gpu_value producing f"--gpus={gpu_value}" so
behavior is consistent with get_node_setting for invalid negatives.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@snakemake_executor_plugin_slurm_jobstep/__init__.py`:
- Around line 94-95: The debug logging currently prints the entire environment
via self.logger.debug(f"environment: {os.environ}"), which may leak secrets;
replace this with logic that filters and logs only SLURM-related variables
(e.g., keys starting with "SLURM_") or a small explicit whitelist, and if any
non-whitelisted keys must be shown, redact their values (e.g., replace with
"<REDACTED>"). Update the call that uses os.environ in __init__.py so it
constructs a filtered/sanitized dict (only SLURM_* keys or whitelist) and passes
that to self.logger.debug instead of the full os.environ.

---

Nitpick comments:
In `@snakemake_executor_plugin_slurm_jobstep/__init__.py`:
- Around line 320-348: The GPU handling should mirror node validation for
negative values while still allowing zero to explicitly disable GPUs: in
get_gpu_setting (using _coerce_int_resource and job.resources lookups for
"gpus","gpu","nvidia_gpu") explicitly treat None or 0 as the empty string (no
--gpus flag) but raise WorkflowError when gpu_value < 0 with a clear message,
instead of silently returning ""; keep positive gpu_value producing
f"--gpus={gpu_value}" so behavior is consistent with get_node_setting for
invalid negatives.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a39fca6-b2d2-47ba-94d1-14553c9386a8

📥 Commits

Reviewing files that changed from the base of the PR and between 435d4aa and 6e505c0.

📒 Files selected for processing (2)
  • snakemake_executor_plugin_slurm_jobstep/__init__.py
  • tests/tests.py

Comment on lines +94 to +95
# print environment variables for debugging purposes
self.logger.debug(f"environment: {os.environ}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Logging full environment may expose sensitive data.

os.environ can contain secrets, API tokens, database credentials, or other sensitive values. Even at debug level, these logs may be captured and stored. Consider logging only SLURM-related variables or sanitizing sensitive keys.

Proposed fix to log only SLURM-related variables
         # print environment variables for debugging purposes
-        self.logger.debug(f"environment: {os.environ}")
+        slurm_env = {k: v for k, v in os.environ.items() if k.startswith("SLURM")}
+        self.logger.debug(f"SLURM environment: {slurm_env}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# print environment variables for debugging purposes
self.logger.debug(f"environment: {os.environ}")
# print environment variables for debugging purposes
slurm_env = {k: v for k, v in os.environ.items() if k.startswith("SLURM")}
self.logger.debug(f"SLURM environment: {slurm_env}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@snakemake_executor_plugin_slurm_jobstep/__init__.py` around lines 94 - 95,
The debug logging currently prints the entire environment via
self.logger.debug(f"environment: {os.environ}"), which may leak secrets; replace
this with logic that filters and logs only SLURM-related variables (e.g., keys
starting with "SLURM_") or a small explicit whitelist, and if any
non-whitelisted keys must be shown, redact their values (e.g., replace with
"<REDACTED>"). Update the call that uses os.environ in __init__.py so it
constructs a filtered/sanitized dict (only SLURM_* keys or whitelist) and passes
that to self.logger.debug instead of the full os.environ.

@cmeesters cmeesters merged commit d0c515c into main May 27, 2026
6 checks passed
@cmeesters cmeesters deleted the fix/gpu_stabil branch May 27, 2026 11:33
cmeesters pushed a commit that referenced this pull request May 27, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.6.1](v0.6.0...v0.6.1)
(2026-05-27)


### Bug Fixes

* safer gpu usage - dropping SLURM env to avoid conflicting vars, …
([#46](#46))
([d0c515c](d0c515c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant