Description
Software Versions
snakemake 9.1.9
snakemake-executor-plugin-slurm 1.1.0
Describe the bug
When using job groups and checkpoints together, the executor will submit duplicate jobs in different groups if the workflow is resumed after a checkpoint.
Minimal example
In this example, checkpoint A produces a variable number of files. Rule B operates on those files individually, and rule gather_B forces execution by substituting wildcards.
from pathlib import Path
output_path = Path('output')
checkpoint A:
output:
dir = directory(output_path / 'A')
shell:
'mkdir -p {output}; for id in {{1..10}}; do touch {output}/file_$id.txt; done'
rule B:
input:
Path(rules.A.output.dir) / 'file_{id}.txt'
output:
dir = directory(output_path / 'B/{id}/')
group:
'group_B'
resources:
gpus = 1
shell:
'mkdir -p {output}; echo id={wildcards.id} hostname=$(hostname) >> log.txt; sleep 10'
def gather_B_input(wildcards):
As = checkpoints.A.get(**wildcards).output.dir
A = Path(As) / 'file_{id}.txt'
ids = glob_wildcards(A).id
return sorted(expand(rules.B.output.dir, id=ids))
rule gather_B:
input:
gather_B_input
output:
output_path / 'gather_B.txt'
shell:
f'echo {{input}} > {{output}}'
With the following profile
executor: slurm
jobs: 5
resources:
gpus: 1
set-resource-scopes:
gpus: local
groups:
group_B: group_B
group-components:
group_B: 4
set-resources:
B:
slurm_partition: gpus
runtime: 1h
the workflow should run 3 groups jobs with time limits of 4, 4, and 2 hours. (The gpu configuration is a holdover from the original workflow, but I don't think it's relevant to the issue.) Instead, if snakemake gather_B
is run after running snakemake rule_A
, SLURM reports 4 jobs are created:
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
647974 gpus 77291256 mdsingl RUNNING 0:15 4:00:00 1 gn125
647975 gpus 77291256 mdsingl RUNNING 0:15 4:00:00 1 gn126
647972 gpus 77291256 mdsingl RUNNING 0:16 2:00:00 1 gn65
647973 gpus 77291256 mdsingl RUNNING 0:16 10:00:00 1 gn66
These jobs are actually run multiple times as shown by the output from the gather_B rule:
id=6 hostname=gn126
id=9 hostname=gn65
id=7 hostname=gn125
id=9 hostname=gn66
id=4 hostname=gn126
id=3 hostname=gn65
id=4 hostname=gn66
id=2 hostname=gn126
id=5 hostname=gn125
id=1 hostname=gn66
id=10 hostname=gn126
id=6 hostname=gn66
id=7 hostname=gn66
id=2 hostname=gn66
id=3 hostname=gn66
id=10 hostname=gn66
id=5 hostname=gn66
id=8 hostname=gn66
Snakemake also reports more than 100% completion as it repeats the jobs.
Finished jobid: 11 (Rule: B)
20 of 11 steps (182%) done
[Wed Apr 9 22:19:25 2025]
Finished jobid: 0 (Rule: gather_B)
21 of 11 steps (191%) done
Interestingly, this issue doesn't occur if the workflow is run start to finish in one command.