Skip to content

Commit cc3d21b

Browse files
cmeesterscoderabbitai[bot]johanneskoester
authored
feat: custom log file behaviour (#159)
This PR should allow to - set a custom log directory for SLURM jobs - enable auto-deletion of older SLURM log files after 10 days - allow changing the that behaviour or disabling auto deletion all together - auto delete log files of successful SLURM jobs (and ignore this default behaviour) The idea behind this PR is to - limit the number of SLURM log files to keep, as a workflow can easily produce thousands of log files. In case of a successful job, the information is redundant anyway, hence the proposed auto-deletion of successful jobs. - users may select a custom directory for SLURM log files - this way different workflows can point to one prefix. Together with the auto-deletion of older log files, this futher limits the number of present log files. It addresses issues #94 and #123. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Enhanced SLURM job log management with a configurable log directory. - Added options to control log file retention and automatic cleanup of old logs. - Introduced new command-line flags for specifying log management settings. - **Documentation** - Updated documentation with new command-line flags for log management. - Improved guidance on managing job logs and their retention policies. - **Improvements** - More flexible control over SLURM job log handling. - Better support for cleaning up empty directories after job completion. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Johannes Köster <[email protected]>
1 parent 531ebc6 commit cc3d21b

File tree

3 files changed

+134
-43
lines changed

3 files changed

+134
-43
lines changed

docs/further.md

Lines changed: 13 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -104,22 +104,7 @@ A workflow rule may support several
104104
[resource specifications](https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources).
105105
For a SLURM cluster, a mapping between Snakemake and SLURM needs to be performed.
106106

107-
You can use the following specifications:
108-
109-
| SLURM | Snakemake | Description |
110-
|----------------|------------|---------------------------------------|
111-
| `--partition` | `slurm_partition` | the partition a rule/job is to use |
112-
| `--time` | `runtime` | the walltime per job in minutes |
113-
| `--constraint` | `constraint` | may hold features on some clusters |
114-
| `--mem` | `mem`, `mem_mb` | memory a cluster node must |
115-
| | | provide (`mem`: string with unit), `mem_mb`: i |
116-
| `--mem-per-cpu` | `mem_mb_per_cpu` | memory per reserved CPU |
117-
| `--ntasks` | `tasks` | number of concurrent tasks / ranks |
118-
| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) |
119-
| `--nodes` | `nodes` | number of nodes |
120-
| `--clusters` | `clusters` | comma separated string of clusters |
121-
122-
Each of these can be part of a rule, e.g.:
107+
Each of the listed command line flags can be part of a rule, e.g.:
123108

124109
``` python
125110
rule:
@@ -158,16 +143,6 @@ set-resources:
158143
cpus_per_task: 40
159144
```
160145
161-
#### Additional Command Line Flags
162-
163-
This plugin defines additional command line flags.
164-
As always, these can be set on the command line or in a profile.
165-
166-
| Flag | Meaning |
167-
|-------------|----------|
168-
| `--slurm_init_seconds_before_status_checks`| modify time before initial job status check; the default of 40 seconds avoids load on querying slurm databases, but shorter wait times are for example useful during workflow development |
169-
| `--slurm_requeue` | allows jobs to be resubmitted automatically if they fail or are preempted. See the [section "retries" for details](#retries)|
170-
171146
#### Multicluster Support
172147
173148
For reasons of scheduling multicluster support is provided by the `clusters` flag in resources sections. Note, that you have to write `clusters`, not `cluster`!
@@ -279,6 +254,18 @@ export SNAKEMAKE_PROFILE="$HOME/.config/snakemake"
279254

280255
==This is ongoing development. Eventually you will be able to annotate different file access patterns.==
281256

257+
### Log Files - Getting Information on Failures
258+
259+
Snakemake, via this SLURM executor, submits itself as a job. This ensures that all features are preserved in the job context. SLURM requires a logfile to be written for _every_ job. This is redundant information and only contains the Snakemake output already printed on the terminal. If a rule is equipped with a `log` directive, SLURM logs only contain Snakemake's output.
260+
261+
This executor will remove SLURM logs of sucessful jobs immediately when they are finished. You can change this behaviour with the flag `--slurm-keep-successful-logs`. A log file for a failed job will be preserved per default for 10 days. You may change this value using the `--slurm-delete-logfiles-older-than` flag.
262+
263+
The default location of Snakemake log files are relative to the directory where the workflow is started or relative to the directory indicated with `--directory`. SLURM logs, produced by Snakemake, can be redirected using `--slurm-logdir`. If you want avoid that log files accumulate in different directories, you can store them in your home directory. Best put the parameter in your profile then, e.g.:
264+
265+
```YAML
266+
slurm-logdir: "/home/<username>/.snakemake/slurm_logs"
267+
```
268+
282269
### Retries - Or Trying again when a Job failed
283270

284271
Some cluster jobs may fail. In this case Snakemake can be instructed to try another submit before the entire workflow fails, in this example up to 3 times:

snakemake_executor_plugin_slurm/__init__.py

Lines changed: 95 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33
__email__ = "[email protected]"
44
__license__ = "MIT"
55

6+
import atexit
67
import csv
78
from io import StringIO
89
import os
10+
from pathlib import Path
911
import re
1012
import shlex
1113
import subprocess
@@ -26,30 +28,59 @@
2628
from snakemake_interface_common.exceptions import WorkflowError
2729
from snakemake_executor_plugin_slurm_jobstep import get_cpus_per_task
2830

29-
from .utils import delete_slurm_environment
31+
from .utils import delete_slurm_environment, delete_empty_dirs
3032

3133

3234
@dataclass
3335
class ExecutorSettings(ExecutorSettingsBase):
36+
logdir: Optional[Path] = field(
37+
default=None,
38+
metadata={
39+
"help": "Per default the SLURM log directory is relative to "
40+
"the working directory."
41+
"This flag allows to set an alternative directory.",
42+
"env_var": False,
43+
"required": False,
44+
},
45+
)
46+
keep_successful_logs: bool = field(
47+
default=False,
48+
metadata={
49+
"help": "Per default SLURM log files will be deleted upon sucessful "
50+
"completion of a job. Whenever a SLURM job fails, its log "
51+
"file will be preserved. "
52+
"This flag allows to keep all SLURM log files, even those "
53+
"of successful jobs.",
54+
"env_var": False,
55+
"required": False,
56+
},
57+
)
58+
delete_logfiles_older_than: Optional[int] = field(
59+
default=10,
60+
metadata={
61+
"help": "Per default SLURM log files in the SLURM log directory "
62+
"of a workflow will be deleted after 10 days. For this, "
63+
"best leave the default log directory unaltered. "
64+
"Setting this flag allows to change this behaviour. "
65+
"If set to <=0, no old files will be deleted. ",
66+
},
67+
)
3468
init_seconds_before_status_checks: Optional[int] = field(
3569
default=40,
3670
metadata={
37-
"help": """
38-
Defines the time in seconds before the first status
39-
check is performed after job submission.
40-
""",
71+
"help": "Defines the time in seconds before the first status "
72+
"check is performed after job submission.",
4173
"env_var": False,
4274
"required": False,
4375
},
4476
)
4577
requeue: bool = field(
4678
default=False,
4779
metadata={
48-
"help": """
49-
Allow requeuing preempted of failed jobs,
50-
if no cluster default. Results in `sbatch ... --requeue ...`
51-
This flag has no effect, if not set.
52-
""",
80+
"help": "Allow requeuing preempted of failed jobs, "
81+
"if no cluster default. Results in "
82+
"`sbatch ... --requeue ...` "
83+
"This flag has no effect, if not set.",
5384
"env_var": False,
5485
"required": False,
5586
},
@@ -91,6 +122,32 @@ def __post_init__(self):
91122
self._fallback_account_arg = None
92123
self._fallback_partition = None
93124
self._preemption_warning = False # no preemption warning has been issued
125+
self.slurm_logdir = None
126+
atexit.register(self.clean_old_logs)
127+
128+
def clean_old_logs(self) -> None:
129+
"""Delete files older than specified age from the SLURM log directory."""
130+
# shorthands:
131+
age_cutoff = self.workflow.executor_settings.delete_logfiles_older_than
132+
keep_all = self.workflow.executor_settings.keep_successful_logs
133+
if age_cutoff <= 0 or keep_all:
134+
return
135+
cutoff_secs = age_cutoff * 86400
136+
current_time = time.time()
137+
self.logger.info(f"Cleaning up log files older than {age_cutoff} day(s)")
138+
for path in self.slurm_logdir.rglob("*.log"):
139+
if path.is_file():
140+
try:
141+
file_age = current_time - path.stat().st_mtime
142+
if file_age > cutoff_secs:
143+
path.unlink()
144+
except (OSError, FileNotFoundError) as e:
145+
self.logger.warning(f"Could not delete logfile {path}: {e}")
146+
# we need a 2nd iteration to remove putatively empty directories
147+
try:
148+
delete_empty_dirs(self.slurm_logdir)
149+
except (OSError, FileNotFoundError) as e:
150+
self.logger.warning(f"Could not delete empty directory {path}: {e}")
94151

95152
def warn_on_jobcontext(self, done=None):
96153
if not done:
@@ -123,18 +180,22 @@ def run_job(self, job: JobExecutorInterface):
123180
except AttributeError:
124181
wildcard_str = ""
125182

126-
slurm_logfile = os.path.abspath(
127-
f".snakemake/slurm_logs/{group_or_rule}/{wildcard_str}/%j.log"
183+
self.slurm_logdir = (
184+
Path(self.workflow.executor_settings.logdir)
185+
if self.workflow.executor_settings.logdir
186+
else Path(".snakemake/slurm_logs").resolve()
128187
)
129-
logdir = os.path.dirname(slurm_logfile)
188+
189+
self.slurm_logdir.mkdir(parents=True, exist_ok=True)
190+
slurm_logfile = self.slurm_logdir / group_or_rule / wildcard_str / "%j.log"
191+
slurm_logfile.parent.mkdir(parents=True, exist_ok=True)
130192
# this behavior has been fixed in slurm 23.02, but there might be plenty of
131193
# older versions around, hence we should rather be conservative here.
132-
assert "%j" not in logdir, (
194+
assert "%j" not in str(self.slurm_logdir), (
133195
"bug: jobid placeholder in parent dir of logfile. This does not work as "
134196
"we have to create that dir before submission in order to make sbatch "
135197
"happy. Otherwise we get silent fails without logfiles being created."
136198
)
137-
os.makedirs(logdir, exist_ok=True)
138199

139200
# generic part of a submission string:
140201
# we use a run_uuid as the job-name, to allow `--name`-based
@@ -247,7 +308,9 @@ def run_job(self, job: JobExecutorInterface):
247308
slurm_jobid = out.strip().split(";")[0]
248309
if not slurm_jobid:
249310
raise WorkflowError("Failed to retrieve SLURM job ID from sbatch output.")
250-
slurm_logfile = slurm_logfile.replace("%j", slurm_jobid)
311+
slurm_logfile = slurm_logfile.with_name(
312+
slurm_logfile.name.replace("%j", slurm_jobid)
313+
)
251314
self.logger.info(
252315
f"Job {job.jobid} has been submitted with SLURM jobid {slurm_jobid} "
253316
f"(log: {slurm_logfile})."
@@ -380,6 +443,19 @@ async def check_active_jobs(
380443
self.report_job_success(j)
381444
any_finished = True
382445
active_jobs_seen_by_sacct.remove(j.external_jobid)
446+
if not self.workflow.executor_settings.keep_successful_logs:
447+
self.logger.debug(
448+
"removing log for successful job "
449+
f"with SLURM ID '{j.external_jobid}'"
450+
)
451+
try:
452+
if j.aux["slurm_logfile"].exists():
453+
j.aux["slurm_logfile"].unlink()
454+
except (OSError, FileNotFoundError) as e:
455+
self.logger.warning(
456+
"Could not remove log file"
457+
f" {j.aux['slurm_logfile']}: {e}"
458+
)
383459
elif status == "PREEMPTED" and not self._preemption_warning:
384460
self._preemption_warning = True
385461
self.logger.warning(
@@ -404,7 +480,9 @@ async def check_active_jobs(
404480
# with a new sentence
405481
f"'{status}'. "
406482
)
407-
self.report_job_error(j, msg=msg, aux_logs=[j.aux["slurm_logfile"]])
483+
self.report_job_error(
484+
j, msg=msg, aux_logs=[j.aux["slurm_logfile"]._str]
485+
)
408486
active_jobs_seen_by_sacct.remove(j.external_jobid)
409487
else: # still running?
410488
yield j

snakemake_executor_plugin_slurm/utils.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# utility functions for the SLURM executor plugin
22

33
import os
4+
from pathlib import Path
45

56

67
def delete_slurm_environment():
@@ -14,3 +15,28 @@ def delete_slurm_environment():
1415
for var in os.environ:
1516
if var.startswith("SLURM_"):
1617
del os.environ[var]
18+
19+
20+
def delete_empty_dirs(path: Path) -> None:
21+
"""
22+
Function to delete all empty directories in a given path.
23+
This is needed to clean up the working directory after
24+
a job has sucessfully finished. This function is needed because
25+
the shutil.rmtree() function does not delete empty
26+
directories.
27+
"""
28+
if not path.is_dir():
29+
return
30+
31+
# Process subdirectories first (bottom-up)
32+
for child in path.iterdir():
33+
if child.is_dir():
34+
delete_empty_dirs(child)
35+
36+
try:
37+
# Check if directory is now empty after processing children
38+
if not any(path.iterdir()):
39+
path.rmdir()
40+
except (OSError, FileNotFoundError) as e:
41+
# Provide more context in the error message
42+
raise OSError(f"Failed to remove empty directory {path}: {e}") from e

0 commit comments

Comments
 (0)