Skip to content

Jobs submitted multiple times in different groups #253

Open
@marcsingleton

Description

@marcsingleton

Software Versions
snakemake 9.1.9
snakemake-executor-plugin-slurm 1.1.0

Describe the bug
When using job groups and checkpoints together, the executor will submit duplicate jobs in different groups if the workflow is resumed after a checkpoint.

Minimal example
In this example, checkpoint A produces a variable number of files. Rule B operates on those files individually, and rule gather_B forces execution by substituting wildcards.

from pathlib import Path

output_path = Path('output')

checkpoint A:
    output:
        dir = directory(output_path / 'A')
    shell:
        'mkdir -p {output}; for id in {{1..10}}; do touch {output}/file_$id.txt; done'

rule B:
    input:
        Path(rules.A.output.dir) / 'file_{id}.txt'
    output:
        dir = directory(output_path / 'B/{id}/')
    group:
        'group_B'
    resources:
        gpus = 1
    shell:
        'mkdir -p {output}; echo id={wildcards.id} hostname=$(hostname) >> log.txt; sleep 10'

def gather_B_input(wildcards):
    As = checkpoints.A.get(**wildcards).output.dir
    A = Path(As) / 'file_{id}.txt'
    ids = glob_wildcards(A).id
    return sorted(expand(rules.B.output.dir, id=ids))

rule gather_B:
    input:
        gather_B_input
    output:
        output_path / 'gather_B.txt'
    shell:
        f'echo {{input}} > {{output}}'

With the following profile

executor: slurm
jobs: 5

resources:
    gpus: 1

set-resource-scopes:
    gpus: local

groups:
    group_B: group_B

group-components:
    group_B: 4

set-resources:
  B:
    slurm_partition: gpus
    runtime: 1h

the workflow should run 3 groups jobs with time limits of 4, 4, and 2 hours. (The gpu configuration is a holdover from the original workflow, but I don't think it's relevant to the issue.) Instead, if snakemake gather_B is run after running snakemake rule_A, SLURM reports 4 jobs are created:

JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
647974    gpus 77291256  mdsingl  RUNNING       0:15   4:00:00      1 gn125
647975    gpus 77291256  mdsingl  RUNNING       0:15   4:00:00      1 gn126
647972    gpus 77291256  mdsingl  RUNNING       0:16   2:00:00      1 gn65
647973    gpus 77291256  mdsingl  RUNNING       0:16  10:00:00      1 gn66

These jobs are actually run multiple times as shown by the output from the gather_B rule:

id=6 hostname=gn126
id=9 hostname=gn65
id=7 hostname=gn125
id=9 hostname=gn66
id=4 hostname=gn126
id=3 hostname=gn65

id=4 hostname=gn66
id=2 hostname=gn126
id=5 hostname=gn125
id=1 hostname=gn66
id=10 hostname=gn126
id=6 hostname=gn66

id=7 hostname=gn66
id=2 hostname=gn66
id=3 hostname=gn66
id=10 hostname=gn66
id=5 hostname=gn66
id=8 hostname=gn66

Snakemake also reports more than 100% completion as it repeats the jobs.

Finished jobid: 11 (Rule: B)
20 of 11 steps (182%) done
[Wed Apr  9 22:19:25 2025]
Finished jobid: 0 (Rule: gather_B)
21 of 11 steps (191%) done

Interestingly, this issue doesn't occur if the workflow is run start to finish in one command.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions