Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added the job resource 'gres' to be considered #173

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cmeesters
Copy link
Member

@cmeesters cmeesters commented Nov 25, 2024

In the light of more and more accelerator applications (AI, base mapping, ...) the fall-back onto slurm_extra becomes a bit tedious to use. Hence, the resource support for gres.

Addresses issue #52 (and to a minor extent: #18 and #104). Supersedes PR #172 .

Summary by CodeRabbit

  • Documentation

    • Enhanced clarity and detail of instructions for using the Snakemake plugin with SLURM.
    • Added new command line flags for job status check timings and automatic job resubmission.
    • Expanded resource specification mapping for SLURM and Snakemake.
    • Introduced dynamic parameterization section with code examples.
  • Bug Fixes

    • Improved error handling and feedback for SLURM job submission failures.
    • Refined job memory specification logic to issue warnings when missing.
  • New Features

    • Added flexible resource allocation for GPU resources during job submission.
    • Enhanced logging for better traceability of job status checks.

Copy link
Contributor

coderabbitai bot commented Nov 25, 2024

Walkthrough

The changes in this pull request involve significant updates to the docs/further.md documentation and enhancements to the job submission process in the snakemake_executor_plugin_slurm package. The documentation now includes new command line flags for SLURM, expanded resource specifications, and improved examples for using the Snakemake plugin. In the code, modifications enhance resource allocation flexibility, improve error handling, and refine job status checking, ensuring clearer feedback and better traceability during job submissions.

Changes

File Path Change Summary
docs/further.md - Added command line flags: --slurm_init_seconds_before_status_checks, --slurm_requeue.
- Expanded resource specifications: slurm_partition, runtime, mem_mb, tasks, cpus_per_task, nodes, clusters, gres.
- Emphasized use of localrules directive.
- Introduced dynamic parameterization section with code snippets.
- Improved document structure and readability with tables.
snakemake_executor_plugin_slurm/init.py - Updated run_job method to append --gres for GPU resources and refined memory handling.
- Enhanced error handling with specific checks for submission failures.
- Improved check_active_jobs method for better job status tracking and logging.

Possibly related PRs

  • feat: added requeue option to client #136: The addition of the --requeue option in the ExecutorSettings class is directly related to the new --slurm_requeue command line flag introduced in the main PR, both facilitating job resubmission.
  • docs: requeue #153: This PR documents the new --slurm_requeue command line flag, which is a key feature highlighted in the main PR, enhancing the usability of the SLURM executor plugin.
  • docs: fix headings in further.md #168: The documentation changes in this PR improve the clarity of instructions related to SLURM, which aligns with the main PR's focus on enhancing documentation for SLURM job specifications and command line flags.

Suggested reviewers

  • johanneskoester
  • fgvieira

🐇 In the garden, changes bloom bright,
New flags and resources, a splendid sight.
With clearer paths and jobs that flow,
Snakemake and SLURM, together they grow!
Errors now caught, like a swift little hare,
Hop on to success, with nary a care! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
docs/further.md (2)

121-121: Enhance the GRES resource documentation with more details and examples

While the current description is accurate, it would be helpful to expand it with:

  • Common GRES types (e.g., gpu, mps, fpga)
  • Multiple resource syntax (e.g., gpu:tesla:2)
  • Resource property specifications
  • Links to SLURM's GRES documentation

Consider updating the table entry to something like:

-| `--gres`  | `gres` | generic resource specification, e.g. `gres=gpu:1`, allows for arbitray strings after the `=` |
+| `--gres`  | `gres` | Generic Resource Scheduling specification. Examples: `gres=gpu:1` (1 GPU), `gres=gpu:tesla:2` (2 Tesla GPUs), `gres=mps:100` (100% MPS). See [SLURM GRES Guide](https://slurm.schedmd.com/gres.html) for details. |

Line range hint 123-130: Add a dedicated GRES example for AI/ML workloads

Given the PR's focus on GRES support for AI applications, it would be valuable to add a specific example showing how to request GPU resources.

Consider adding this example after the current generic example:

 rule:
     input: ...
     output: ...
     resources:
         partition=<partition name>
         runtime=<some number>
+
+# Example: GPU-accelerated deep learning training
+rule train_model:
+    input: "data/training_set.h5"
+    output: "models/trained_model.pt"
+    resources:
+        # Request 2 NVIDIA A100 GPUs
+        gres="gpu:a100:2",
+        # Ensure sufficient CPU cores and memory for data preprocessing
+        cpus_per_task=8,
+        mem_mb=32000
snakemake_executor_plugin_slurm/__init__.py (1)

Line range hint 313-315: Consider making status_attempts configurable

The number of status check attempts is hardcoded. Consider making it configurable through the ExecutorSettings class to allow users to adjust based on their cluster's characteristics.

+    status_check_attempts: int = field(
+        default=5,
+        metadata={
+            "help": "Number of attempts to check job status before giving up",
+            "env_var": False,
+            "required": False,
+        },
+    )

-        status_attempts = 5
+        status_attempts = self.workflow.executor_settings.status_check_attempts
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 531ebc6 and b58dc97.

📒 Files selected for processing (2)
  • docs/further.md (1 hunks)
  • snakemake_executor_plugin_slurm/__init__.py (1 hunks)
🔇 Additional comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)

Line range hint 238-251: LGTM: Robust error handling implementation

The error handling for job submission is comprehensive, covering both return code errors and submission failure messages, with clear error reporting.


Line range hint 174-198: LGTM: Comprehensive resource management

The implementation handles various SLURM resources effectively with appropriate warnings and fallback mechanisms. The code properly manages:

  • Runtime constraints
  • Memory allocation (both per-job and per-CPU)
  • Constraint specifications

snakemake_executor_plugin_slurm/__init__.py Outdated Show resolved Hide resolved
@cmeesters
Copy link
Member Author

@CodeRabbit no, this PR only adds gres as a possible resource specification. As stated by you, other PRs address other features.

Copy link
Contributor

coderabbitai bot commented Nov 25, 2024

Tip

For best results, initiate chat on the files or code changes.

@cmeesters: Understood, thank you for the clarification. I acknowledge that this PR only adds gres as a possible resource specification, and other features are addressed in other PRs.


✏️ Learnings added
Learnt from: cmeesters
URL: https://github.com/snakemake/snakemake-executor-plugin-slurm/pull/173

Timestamp: 2024-11-25T21:10:22.150Z
Learning: PR #173 adds `gres` as a possible resource specification in the Snakemake executor plugin for SLURM; other features are addressed in separate PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

161-170: LGTM! Well-implemented GRES support.

The implementation correctly validates the GRES format and provides clear error messages. The regex pattern accurately matches SLURM's GRES syntax requirements.

Consider adding code comments with common GRES examples to help users:

 if job.resources.get("gres"):
+    # Common GRES examples:
+    # gpu:1         - Request 1 GPU of any type
+    # gpu:tesla:2   - Request 2 Tesla GPUs
+    # mps:100       - Request 100% of MPS (CUDA Multi-Process Service)
     # Validate GRES format (e.g., "gpu:1", "gpu:tesla:2")
     gres = job.resources.gres
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b58dc97 and 4f9035f.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/__init__.py (1 hunks)

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. In addition, I would however suggest to further support two resources gpu: int and gpu_model: str, which can be internally translated into a gres parametrization (while an explicit gres specification takes always precedence). This way, we are consistent with an emerging common practice, which is currently also being implemented for kubernetes here: snakemake/snakemake-executor-plugin-kubernetes#31

@cmeesters
Copy link
Member Author

@johanneskoester SLURM's generic approach is to write --gres=gpu:<hardwaremodel>:n in this case. Alas, it allows for two different kinds of plugins, to alter this behaviour. This is why I first settled for the slurm_extra string as a catch-all. Meanwhile, more and more admins settled for the generic approach. Anyone on a deviating cluster will still have to use slurm_extra.

I'm fine with providing two more resources for the sake of consistency. However, we need to keep an eye on ease, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants