Skip to content

Parsing issue when using gres #324

Open
@sprdmt

Description

@sprdmt

Software Versions
snakemake version 9.6.0
snakemake-executor-plugin-slurm version 1.4.0
slurm 25.05.0

Describe the bug
Parsing problem occurs when I am trying to submit the snakemake rules with gres specified. I tried both gres="shard:1" and gres="gpu:1" (since our cluster supports GPU shards). When gres is defined, the job gets sent correctly to a GPU node, however, it immediately crashes with an error message:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, cpus_per_task=1, jobs=20, tasks=-1
Select jobs to execute...
Execute 1 jobs...

[Sun Jun 22 18:15:32 2025]
rule test_gpu:
    output: gpu_test/.done
    jobid: 0
    reason: Forced execution
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=***, slurm_user=***, cpus_per_task=1, runtime=1, jobs=20, gres=gpu:1, slurm_partition=gpu, tasks=-1
Shell command: 
        ml conda
        conda activate semibin-2.2.0
        mkdir -p gpu_test
        echo "Running on GPU node " > gpu_test/test_output.txt
        echo "GPU test job completed." >> gpu_test/test_output.txt
        SemiBin2 single_easy_bin -h >> gpu_test/test_output.txt 
        nvidia-smi >> gpu_test/test_output.txt
        touch gpu_test/.done
        
[2025-06-22T18:15:32.749] error: TaskProlog failed status=230
PARSE ERROR: Argument: 496
             Couldn't find match for argument

Usage: 
   dcgmi group --host <IP/FQDN> -l -j 
   dcgmi group --host <IP/FQDN> -c <groupName> --default
        --defaultnvswitches 
   dcgmi group --host <IP/FQDN> -c <groupName> -a <entityId> 
   dcgmi group --host <IP/FQDN> -d <groupId> 
   dcgmi group --host <IP/FQDN> -g <groupId> -i -j 
   dcgmi group --host <IP/FQDN> -g <groupId> -a <entityId> 
   dcgmi group --host <IP/FQDN> -g <groupId> -r <entityId> 


For complete USAGE and HELP type: 
   dcgmi group --help`

Minimal example

I use the command
snakemake --snakefile Snakefile --profile slurm-lisc --jobs 1

my glogal config.yaml is:
executor: slurm
jobname: "{rulename}.{jobid}.sh"
retries: 0
latency-wait: 60
keep-going: true
use-conda: false
printshellcmds: true

default-resources:
mem_mb: 1000
slurm_account: ''
slurm_user: '
'
cpus_per_task: 1
runtime: 1h
jobs: 20

My slurm-profile.sh is:
#!/bin/bash
#SBATCH --job-name=smk-{rule}
#SBATCH --account={resources.slurm_account}
#SBATCH --output=logs/{rule}.{jobid}.out
#SBATCH --error=logs/{rule}.{jobid}.err
#SBATCH --mem={resources.mem_mb}M
#SBATCH --time={resources.runtime}
#SBATCH --mail-user=***
#SBATCH --mail-type=ALL
#SBATCH --no-requeue
#SBATCH --cpus-per-task={threads}

ml conda

{exec_job}

My Snakefile is:
rule test_gpu:
output:
"gpu_test/.done"
threads: 1
resources:
mem_mb=1000,
gres='shard:1',
runtime=1
shell:
"""
ml conda
conda activate semibin-2.2.0
mkdir -p gpu_test
echo "Running on GPU node " > gpu_test/test_output.txt
echo "GPU test job completed." >> gpu_test/test_output.txt
SemiBin2 single_easy_bin -h >> gpu_test/test_output.txt
nvidia-smi >> gpu_test/test_output.txt
touch {output}
"""

Additional context

if I run the same script but do not specify gres, it runs perfectly fine on a CPU node. Seems like the problem is with parsing the gres value but I lack the knowledge to solve it myself.
Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions