-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added the job resource 'gres' to be considered #173
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes in this pull request involve significant updates to the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
docs/further.md (2)
121-121
: Enhance the GRES resource documentation with more details and examplesWhile the current description is accurate, it would be helpful to expand it with:
- Common GRES types (e.g., gpu, mps, fpga)
- Multiple resource syntax (e.g.,
gpu:tesla:2
)- Resource property specifications
- Links to SLURM's GRES documentation
Consider updating the table entry to something like:
-| `--gres` | `gres` | generic resource specification, e.g. `gres=gpu:1`, allows for arbitray strings after the `=` | +| `--gres` | `gres` | Generic Resource Scheduling specification. Examples: `gres=gpu:1` (1 GPU), `gres=gpu:tesla:2` (2 Tesla GPUs), `gres=mps:100` (100% MPS). See [SLURM GRES Guide](https://slurm.schedmd.com/gres.html) for details. |
Line range hint
123-130
: Add a dedicated GRES example for AI/ML workloadsGiven the PR's focus on GRES support for AI applications, it would be valuable to add a specific example showing how to request GPU resources.
Consider adding this example after the current generic example:
rule: input: ... output: ... resources: partition=<partition name> runtime=<some number> + +# Example: GPU-accelerated deep learning training +rule train_model: + input: "data/training_set.h5" + output: "models/trained_model.pt" + resources: + # Request 2 NVIDIA A100 GPUs + gres="gpu:a100:2", + # Ensure sufficient CPU cores and memory for data preprocessing + cpus_per_task=8, + mem_mb=32000snakemake_executor_plugin_slurm/__init__.py (1)
Line range hint
313-315
: Consider making status_attempts configurableThe number of status check attempts is hardcoded. Consider making it configurable through the ExecutorSettings class to allow users to adjust based on their cluster's characteristics.
+ status_check_attempts: int = field( + default=5, + metadata={ + "help": "Number of attempts to check job status before giving up", + "env_var": False, + "required": False, + }, + ) - status_attempts = 5 + status_attempts = self.workflow.executor_settings.status_check_attempts
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
docs/further.md
(1 hunks)snakemake_executor_plugin_slurm/__init__.py
(1 hunks)
🔇 Additional comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)
Line range hint 238-251
: LGTM: Robust error handling implementation
The error handling for job submission is comprehensive, covering both return code errors and submission failure messages, with clear error reporting.
Line range hint 174-198
: LGTM: Comprehensive resource management
The implementation handles various SLURM resources effectively with appropriate warnings and fallback mechanisms. The code properly manages:
- Runtime constraints
- Memory allocation (both per-job and per-CPU)
- Constraint specifications
@CodeRabbit no, this PR only adds |
Tip For best results, initiate chat on the files or code changes. @cmeesters: Understood, thank you for the clarification. I acknowledge that this PR only adds ✏️ Learnings added
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
161-170
: LGTM! Well-implemented GRES support.The implementation correctly validates the GRES format and provides clear error messages. The regex pattern accurately matches SLURM's GRES syntax requirements.
Consider adding code comments with common GRES examples to help users:
if job.resources.get("gres"): + # Common GRES examples: + # gpu:1 - Request 1 GPU of any type + # gpu:tesla:2 - Request 2 Tesla GPUs + # mps:100 - Request 100% of MPS (CUDA Multi-Process Service) # Validate GRES format (e.g., "gpu:1", "gpu:tesla:2") gres = job.resources.gres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. In addition, I would however suggest to further support two resources gpu: int
and gpu_model: str
, which can be internally translated into a gres parametrization (while an explicit gres specification takes always precedence). This way, we are consistent with an emerging common practice, which is currently also being implemented for kubernetes here: snakemake/snakemake-executor-plugin-kubernetes#31
@johanneskoester SLURM's generic approach is to write I'm fine with providing two more resources for the sake of consistency. However, we need to keep an eye on ease, too. |
In the light of more and more accelerator applications (AI, base mapping, ...) the fall-back onto
slurm_extra
becomes a bit tedious to use. Hence, the resource support forgres
.Addresses issue #52 (and to a minor extent: #18 and #104). Supersedes PR #172 .
Summary by CodeRabbit
Documentation
Bug Fixes
New Features