Skip to content
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
b58dc97
feat: added the job resource 'gres' to be considered
cmeesters Nov 25, 2024
4f9035f
feat: added regex based validator for gres
cmeesters Nov 25, 2024
350577c
feat: supporting selection from gpu:int and gpu_module, too
cmeesters Jan 2, 2025
71eb4fe
feat: providing better gpu model validation and selection
cmeesters Jan 2, 2025
1a5c1b1
fix: bug associated with not reading my own last commit
cmeesters Jan 2, 2025
8d22386
fix: satisfiying the linter
cmeesters Jan 3, 2025
97aca06
fix: resolved merge conflict
cmeesters Jan 8, 2025
00ab13c
feat: using , if gpu resources are selected
cmeesters Jan 8, 2025
cb94eb9
feat: auto distinction between ntasks and ntasks_per_gpu
cmeesters Jan 10, 2025
3f36f86
feat: first half of auto selection between cpus-per-task and cpus-per…
cmeesters Jan 10, 2025
2b50601
fix: missing colon
cmeesters Jan 10, 2025
0e18736
feat: moved gres/gpu support to utils file - else the executor gets u…
cmeesters Jan 13, 2025
8914b0f
fix: added missing WorkflowError import
cmeesters Jan 13, 2025
3bcbe05
fix: removed duplicate check
cmeesters Jan 13, 2025
df62a6f
docs: added docs for the new features
cmeesters Jan 13, 2025
1610f76
Merge branch 'main' into feat/gres
cmeesters Feb 10, 2025
e44bcff
Merge branch 'main' into feat/gres
cmeesters Feb 10, 2025
9f2427c
Update docs/further.md
cmeesters Feb 14, 2025
7ebf5a7
fix: resource is 'gpus' not 'gpu'
cmeesters Feb 18, 2025
2bf4fb1
feat: adding information about model selection, gpu selection and adj…
cmeesters Feb 18, 2025
5db9967
fix: merge conflict
cmeesters Feb 18, 2025
bc87815
Merge branch 'main' into feat/gres
cmeesters Feb 18, 2025
f29641e
fix: solved lingering merge conflict ...
cmeesters Mar 5, 2025
7be34f1
gnarf
cmeesters Mar 5, 2025
0f3bd41
fix: merge conflict
cmeesters Mar 5, 2025
d38a8ef
fix: updated toml to account for the current jobstep executor plugin
cmeesters Mar 5, 2025
78252d7
fix: allowing both 'gpu' and 'gpus' resource strings
cmeesters Mar 6, 2025
bf9da9a
fix: added missing check for non-gpu-jobs
cmeesters Mar 6, 2025
3b9aa2a
Update snakemake_executor_plugin_slurm/utils.py
cmeesters Mar 6, 2025
17b28ea
Update snakemake_executor_plugin_slurm/utils.py
cmeesters Mar 6, 2025
32c8a21
fix: syntax error introduced by using coderabbitai
cmeesters Mar 6, 2025
261f94d
fix: line was too long
cmeesters Mar 6, 2025
69eb078
fix: formatting
cmeesters Mar 6, 2025
c63d34f
feat: added mock tests
cmeesters Mar 6, 2025
f89e7c5
fix: added missing import
cmeesters Mar 6, 2025
bcad610
fix: logic
cmeesters Mar 6, 2025
7110b8e
fix: added missing error case
cmeesters Mar 6, 2025
9a06f8b
fix: logic
cmeesters Mar 6, 2025
89f507d
fix: removed resource 'gpus' - only 'gpu' shall be supported within S…
cmeesters Mar 10, 2025
3aaefa3
fix: test string
cmeesters Mar 10, 2025
67d7805
Update docs/further.md
cmeesters Mar 10, 2025
0d7073b
fix: settled for resource 'gpu' and 'gpu_model', only
cmeesters Mar 11, 2025
7524293
fix: linting done
cmeesters Mar 11, 2025
0bab217
fix: assuming 'gpu' resourse validation is in snakemake
cmeesters Mar 11, 2025
3877405
Update docs/further.md
johanneskoester Mar 11, 2025
09d6f2b
Update snakemake_executor_plugin_slurm/__init__.py
johanneskoester Mar 11, 2025
9d71beb
Update snakemake_executor_plugin_slurm/utils.py
johanneskoester Mar 11, 2025
ae126ca
fix: linting
cmeesters Mar 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/further.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,36 @@ $ snakemake --set-resources calc_pi:mpi="mpiexec" ...

To submit "ordinary" MPI jobs, submitting with `tasks` (the MPI ranks) is sufficient. Alternatively, on some clusters, it might be convenient to just configure `nodes`. Consider using a combination of `tasks` and `cpus_per_task` for hybrid applications (those that use ranks (multiprocessing) and threads). A detailed topology layout can be achieved using the `slurm_extra` parameter (see below) using further flags like `--distribution`.

### GPU Jobs

SLURM allows to specify GPU request with the `--gres` or `--gpus` flags and Snakemake takes a similar approach. Resources can be asked for with

- Using the Snakemake resours `gres`, the syntax is `<string>:<number>` or `<string>:<model>:<number>`, i.e. `gres=gpu:1` or `gres=gpu:a100:2` (assuming GPU model).
- alternatively, the Snakemake resource `gpu` can be used, e.g. by just requesting the number of GPUs like `gpu=2`. This can be combined with the `gpu_model` resource, i.e. `gpu_model=a100` or independently. The combination will result in a flag to `sbatch` like `--gpus=a100:2`. The Snakemake `gpu` resource has to be number.

.. note:: Internally, Snakemake knows the resource `gpu_manufacturer`, too. However, SLURM does not know the distinction between model and manufacturer. Essentially, the preferred way to request an accelerator will depend on your specific cluster setup.
Also, to be consistent within Snakemake, the resource is called `gpu` not `gpus`.

Additionally, `cpus_per_gpu` can be set - Snakemakes `threads` settings will otherwise be used to set `cpus_per_gpu`. If `cpus_per_gpu` is lower or equal to zero, no CPU is requested from SLURM (and cluster defaults will kick in, if any).

A sample workflow profile might look like:

```YAML
set-resources:
gres_request_rule:
gres: "gpu:1"

multi_gpu_rule:
gpu: 2
gpu_model: "a100"
cpus_per_gpu: 4

no_cpu_gpu_rule:
gpu: 1
cpus_per_gpu: 0 # Values <= 0 indicate that NO CPU request string
# will be issued.
```

### Running Jobs locally

Not all Snakemake workflows are adapted for heterogeneous environments, particularly clusters. Users might want to avoid the submission of _all_ rules as cluster jobs. Non-cluster jobs should usually include _short_ jobs, e.g. internet downloads or plotting rules.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ keywords = ["snakemake", "plugin", "executor", "cluster", "slurm"]
python = "^3.11"
snakemake-interface-common = "^1.13.0"
snakemake-interface-executor-plugins = "^9.1.1"
snakemake-executor-plugin-slurm-jobstep = "^0.2.0"
snakemake-executor-plugin-slurm-jobstep = "^0.3.0"
throttler = "^1.2.2"

[tool.poetry.group.dev.dependencies]
Expand Down
21 changes: 16 additions & 5 deletions snakemake_executor_plugin_slurm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@
JobExecutorInterface,
)
from snakemake_interface_common.exceptions import WorkflowError
from snakemake_executor_plugin_slurm_jobstep import get_cpus_per_task
from snakemake_executor_plugin_slurm_jobstep import get_cpu_setting

from .utils import delete_slurm_environment, delete_empty_dirs
from .utils import delete_slurm_environment, delete_empty_dirs, set_gres_string


@dataclass
Expand Down Expand Up @@ -217,6 +217,8 @@ def run_job(self, job: JobExecutorInterface):
if self.workflow.executor_settings.requeue:
call += " --requeue"

call += set_gres_string(job)

if job.resources.get("clusters"):
call += f" --clusters {job.resources.clusters}"

Expand Down Expand Up @@ -247,7 +249,15 @@ def run_job(self, job: JobExecutorInterface):

# fixes #40 - set ntasks regardless of mpi, because
# SLURM v22.05 will require it for all jobs
call += f" --ntasks={job.resources.get('tasks', 1)}"
gpu_job = (
job.resources.get("gpu")
or job.resources.get("gpus")
or "gpu" in job.resources.get("gres", "")
)
if gpu_job:
call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}"
else:
call += f" --ntasks={job.resources.get('tasks', 1)}"
# MPI job
if job.resources.get("mpi", False):
if not job.resources.get("tasks_per_node") and not job.resources.get(
Expand All @@ -258,8 +268,9 @@ def run_job(self, job: JobExecutorInterface):
"specified. Assuming 'tasks_per_node=1'."
"Probably not what you want."
)

call += f" --cpus-per-task={get_cpus_per_task(job)}"
# we need to set cpus-per-task OR cpus-per-gpu, the function
# will return a string with the corresponding value
call += f" {get_cpu_setting(job, gpu_job)}"

if job.resources.get("slurm_extra"):
self.check_slurm_extra(job)
Expand Down
60 changes: 60 additions & 0 deletions snakemake_executor_plugin_slurm/utils.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
# utility functions for the SLURM executor plugin

import os
import re
from pathlib import Path

from snakemake_interface_executor_plugins.jobs import (
JobExecutorInterface,
)
from snakemake_interface_common.exceptions import WorkflowError


def delete_slurm_environment():
"""
Expand Down Expand Up @@ -40,3 +46,57 @@ def delete_empty_dirs(path: Path) -> None:
except (OSError, FileNotFoundError) as e:
# Provide more context in the error message
raise OSError(f"Failed to remove empty directory {path}: {e}") from e


def set_gres_string(job: JobExecutorInterface) -> str:
"""
Function to set the gres string for the SLURM job
based on the resources requested in the job.
"""
# generic resources (GRES) arguments can be of type
# "string:int" or "string:string:int"
gres_re = re.compile(r"^[a-zA-Z0-9_]+(:[a-zA-Z0-9_]+)?:\d+$")
# gpu model arguments can be of type "string"
gpu_model_re = re.compile(r"^[a-zA-Z0-9_]+$")
# The Snakemake resources can be only be of type "int",
# hence no further regex is needed.

gpu_string = None
if job.resources.get("gpu"):
gpu_string = str(job.resources.get("gpu"))

gpu_model = None
if job.resources.get("gpu_model"):
gpu_model = job.resources.get("gpu_model")

# ensure that gres is not set, if gpu and gpu_model are set
if job.resources.get("gres") and gpu_string:
raise WorkflowError("GRES and GPU are set. Please only set one" " of them.")
elif not job.resources.get("gres") and not gpu_model and not gpu_string:
return ""

if job.resources.get("gres"):
# Validate GRES format (e.g., "gpu:1", "gpu:tesla:2")
gres = job.resources.gres
if not gres_re.match(gres):
raise WorkflowError(
f"Invalid GRES format: {gres}. Expected format: "
"'<name>:<number>' or '<name>:<type>:<number>' "
"(e.g., 'gpu:1' or 'gpu:tesla:2')"
)
return f" --gres={job.resources.gres}"

if gpu_model and gpu_string:
# validate GPU model format
if not gpu_model_re.match(gpu_model):
raise WorkflowError(
f"Invalid GPU model format: {gpu_model}."
" Expected format: '<name>' (e.g., 'tesla')"
)
return f" --gpus={gpu_model}:{gpu_string}"
elif gpu_model and not gpu_string:
raise WorkflowError("GPU model is set, but no GPU number is given")
elif gpu_string:
# we assume here, that the validator ensures that the 'gpu_string'
# is an integer
return f" --gpus={gpu_string}"
101 changes: 101 additions & 0 deletions tests/tests.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
from typing import Optional
import snakemake.common.tests
from snakemake_interface_executor_plugins.settings import ExecutorSettingsBase
from unittest.mock import MagicMock
import pytest

from snakemake_executor_plugin_slurm import ExecutorSettings
from snakemake_executor_plugin_slurm.utils import set_gres_string
from snakemake_interface_common.exceptions import WorkflowError


class TestWorkflows(snakemake.common.tests.TestWorkflowsLocalStorageBase):
Expand All @@ -18,3 +22,100 @@ def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
class TestWorkflowsRequeue(TestWorkflows):
def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
return ExecutorSettings(requeue=True)


class TestGresString:
"""Test cases for the set_gres_string function."""

@pytest.fixture
def mock_job(self):
"""Create a mock job with configurable resources."""

def _create_job(**resources):
mock_resources = MagicMock()
# Configure get method to return values from resources dict
mock_resources.get.side_effect = lambda key, default=None: resources.get(
key, default
)
# Add direct attribute access for certain resources
for key, value in resources.items():
setattr(mock_resources, key, value)

mock_job = MagicMock()
mock_job.resources = mock_resources
return mock_job

return _create_job

def test_no_gres_or_gpu(self, mock_job):
"""Test with no GPU or GRES resources specified."""
job = mock_job()
assert set_gres_string(job) == ""

def test_valid_gres_simple(self, mock_job):
"""Test with valid GRES format (simple)."""
job = mock_job(gres="gpu:1")
assert set_gres_string(job) == " --gres=gpu:1"

def test_valid_gres_with_model(self, mock_job):
"""Test with valid GRES format including GPU model."""
job = mock_job(gres="gpu:tesla:2")
assert set_gres_string(job) == " --gres=gpu:tesla:2"

def test_invalid_gres_format(self, mock_job):
"""Test with invalid GRES format."""
job = mock_job(gres="gpu")
with pytest.raises(WorkflowError, match="Invalid GRES format"):
set_gres_string(job)

def test_invalid_gres_format_missing_count(self, mock_job):
"""Test with invalid GRES format (missing count)."""
job = mock_job(gres="gpu:tesla:")
with pytest.raises(WorkflowError, match="Invalid GRES format"):
set_gres_string(job)

def test_valid_gpu_number(self, mock_job):
"""Test with valid GPU number."""
job = mock_job(gpu="2")
assert set_gres_string(job) == " --gpus=2"

def test_valid_gpu_with_name(self, mock_job):
"""Test with valid GPU name and number."""
job = mock_job(gpu="tesla:2")
assert set_gres_string(job) == " --gpus=tesla:2"

def test_invalid_gpu_format(self, mock_job):
"""Test with invalid GPU format."""
job = mock_job(gpu="invalid:format:2")
with pytest.raises(WorkflowError, match="Invalid GPU format"):
set_gres_string(job)

def test_gpu_with_model(self, mock_job):
"""Test GPU with model specification."""
job = mock_job(gpu="2", gpu_model="tesla")
assert set_gres_string(job) == " --gpus=tesla:2"

def test_gpu_with_manufacturer(self, mock_job):
"""Test GPU with manufacturer specification."""
job = mock_job(gpu="2", gpu_manufacturer="nvidia")
assert set_gres_string(job) == " --gpus=nvidia:2"

def test_invalid_gpu_model_format(self, mock_job):
"""Test with invalid GPU model format."""
job = mock_job(gpu="2", gpu_model="invalid:model")
with pytest.raises(WorkflowError, match="Invalid GPU model format"):
set_gres_string(job)

def test_gpu_model_without_gpu(self, mock_job):
"""Test GPU model without GPU number."""
job = mock_job(gpu_model="tesla")
with pytest.raises(
WorkflowError, match="GPU model is set, but no GPU number is given"
):
set_gres_string(job)

def test_both_gres_and_gpu_set(self, mock_job):
"""Test error case when both GRES and GPU are specified."""
job = mock_job(gres="gpu:1", gpu="2")
with pytest.raises(WorkflowError, match="GRES and GPU are set"):
set_gres_string(job)
Loading