Description
Software Versions
snakemake version 9.6.0
snakemake-executor-plugin-slurm version 1.4.0
slurm 25.05.0
Describe the bug
Parsing problem occurs when I am trying to submit the snakemake rules with gres specified. I tried both gres="shard:1" and gres="gpu:1" (since our cluster supports GPU shards). When gres is defined, the job gets sent correctly to a GPU node, however, it immediately crashes with an error message:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, cpus_per_task=1, jobs=20, tasks=-1
Select jobs to execute...
Execute 1 jobs...
[Sun Jun 22 18:15:32 2025]
rule test_gpu:
output: gpu_test/.done
jobid: 0
reason: Forced execution
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=***, slurm_user=***, cpus_per_task=1, runtime=1, jobs=20, gres=gpu:1, slurm_partition=gpu, tasks=-1
Shell command:
ml conda
conda activate semibin-2.2.0
mkdir -p gpu_test
echo "Running on GPU node " > gpu_test/test_output.txt
echo "GPU test job completed." >> gpu_test/test_output.txt
SemiBin2 single_easy_bin -h >> gpu_test/test_output.txt
nvidia-smi >> gpu_test/test_output.txt
touch gpu_test/.done
[2025-06-22T18:15:32.749] error: TaskProlog failed status=230
PARSE ERROR: Argument: 496
Couldn't find match for argument
Usage:
dcgmi group --host <IP/FQDN> -l -j
dcgmi group --host <IP/FQDN> -c <groupName> --default
--defaultnvswitches
dcgmi group --host <IP/FQDN> -c <groupName> -a <entityId>
dcgmi group --host <IP/FQDN> -d <groupId>
dcgmi group --host <IP/FQDN> -g <groupId> -i -j
dcgmi group --host <IP/FQDN> -g <groupId> -a <entityId>
dcgmi group --host <IP/FQDN> -g <groupId> -r <entityId>
For complete USAGE and HELP type:
dcgmi group --help`
Minimal example
I use the command
snakemake --snakefile Snakefile --profile slurm-lisc --jobs 1
my glogal config.yaml is:
executor: slurm
jobname: "{rulename}.{jobid}.sh"
retries: 0
latency-wait: 60
keep-going: true
use-conda: false
printshellcmds: true
default-resources:
mem_mb: 1000
slurm_account: ''
slurm_user: ''
cpus_per_task: 1
runtime: 1h
jobs: 20
My slurm-profile.sh is:
#!/bin/bash
#SBATCH --job-name=smk-{rule}
#SBATCH --account={resources.slurm_account}
#SBATCH --output=logs/{rule}.{jobid}.out
#SBATCH --error=logs/{rule}.{jobid}.err
#SBATCH --mem={resources.mem_mb}M
#SBATCH --time={resources.runtime}
#SBATCH --mail-user=***
#SBATCH --mail-type=ALL
#SBATCH --no-requeue
#SBATCH --cpus-per-task={threads}
ml conda
{exec_job}
My Snakefile is:
rule test_gpu:
output:
"gpu_test/.done"
threads: 1
resources:
mem_mb=1000,
gres='shard:1',
runtime=1
shell:
"""
ml conda
conda activate semibin-2.2.0
mkdir -p gpu_test
echo "Running on GPU node " > gpu_test/test_output.txt
echo "GPU test job completed." >> gpu_test/test_output.txt
SemiBin2 single_easy_bin -h >> gpu_test/test_output.txt
nvidia-smi >> gpu_test/test_output.txt
touch {output}
"""
Additional context
if I run the same script but do not specify gres, it runs perfectly fine on a CPU node. Seems like the problem is with parsing the gres value but I lack the knowledge to solve it myself.
Thanks in advance!