Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'fs' gets a literal place in the path string #19

Open
cmeesters opened this issue Jun 18, 2024 · 13 comments
Open

'fs' gets a literal place in the path string #19

cmeesters opened this issue Jun 18, 2024 · 13 comments
Assignees

Comments

@cmeesters
Copy link
Member

Hoi,

after a long while, I tested again and apparently the previous fix stopped working. That does not make any sense, so probably, I am doing something wrong.

With

$ cat ~/.config/snakemake/config.yaml 
executor: slurm
latency-wait: 5
default-storage-provider: fs
shared-fs-usage:
  - persistence
  - sources
  - source-cache
remote-job-local-storage-prefix: /localscratch/$SLURM_JOB_ID
local-storage-prefix: /dev/shm/$USER

and a workflow like:

localrules: produce_input,

rule all:
     input: "results/a.out"

rule produce_input:
     output: temp("foo.txt")
     shell: "touch {output}"

rule stagein_test:
     input: rules.produce_input.output
     output: "results/a.out"
     shell: """cp {input} {output};
       echo $(realpath {input}) >> {output}
     """

I get:

Error in rule stagein_test:
    message: SLURM-job '15703262' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: foo.txt (retrieve from storage)
    output: results/a.out (send to storage)
    log: /gpfs/fs1/home/meesters/projects/hpc-jgu-lifescience/snakemake-workflows/test-workflow/.snakemake/slurm_logs/rule_stagein_test/15703262.log (check log file(s) for error details)
    shell:
        cp /dev/shm/meesters/fs/foo.txt /dev/shm/meesters/fs/results/a.out;
       echo $(realpath /dev/shm/meesters/fs/foo.txt) >> /dev/shm/meesters/fs/results/a.out
     
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 15703262

My assumption is that, the local-storage-prefix is local in the SLUM job, as the CPU-executor is oblivious to be running in a job.

I also observe, that the fs in default-storage-provider is a literal in attempted path.

The intended behaviour would be, that the input is copied to remote-job-local-storage-prefix and any output back to the actual (relative) path(s).

@cmeesters
Copy link
Member Author

my assumption appears to be wrong: the remote flag is assigned to the actual Snakemake process running on the compute node.

@johanneskoester
Copy link
Contributor

Mhm, unsure what is wrong there. Maybe we should have a call to sort that out. Perhaps after our call this Friday.

@johanneskoester
Copy link
Contributor

One important thing to see would be the log of the slurm job that fails.

@johanneskoester
Copy link
Contributor

The shell command in the error is likely misleading, because it is formatted with the local representations of input and output files (as if the job would not run in slurm). We should at least add a disclaimer to the shell command that is printed in the error case.

@cmeesters
Copy link
Member Author

cmeesters commented Jul 3, 2024

ah, yes, of course.

As to the log file:

WorkflowError:
Failed to create local storage prefix /localscratch/fs
PermissionError: [Errno 13] Permission denied: '/localscratch/fs'
  File "/gpfs/fs1/home/meesters/projects/hpc-jgu-lifescience/snakemake-interface-storage-plugins/snakemake_interface_storage_plugins/storage_provider.py", line 67, in __init__

This is the slurm log - of course, the path /localscratch/fs does not exist. It ought to be /localscratch/15703262 with my particular job id at the time.

The Snakemake log is like the one pasted above, just more boilerplate. I noticed:

 Building DAG of jobs...
SLURM run ID: c32a7540-b225-401c-8ce3-7916a4fd0115
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807

what is this insane number for the provided remote nodes? (no, our cluster is slightly smaller ;-) ).

Cheers
Christian

@johanneskoester
Copy link
Contributor

johanneskoester commented Jul 5, 2024

The problem was a premature replacement (by an empty string) of the slurm jobid envvar. The fix is here: snakemake/snakemake#2943. Basically, we now just use the base64-encoding mechanism of Snakemake CLI to hide eventual envvars from being evaluated by the shell when they are passed to the cluster backend.

@cmeesters
Copy link
Member Author

Great! Would you like to wait for another release and gather more fixes or features - or just release?

@cmeesters
Copy link
Member Author

I'm afraid, the issue was closed prematurely. It persists with Snakemake version 8.15.2.

@johanneskoester
Copy link
Contributor

We definitely need a testcase for this then. If the testcase can be made without slurm, it should be in snakemake/snakemake.

@cmeesters
Copy link
Member Author

For this, the CI would need to have a tmp-path with an env variable. Do we have something like it?

@johanneskoester
Copy link
Contributor

pytest allows to get a tmp_path by adding it as a parameter to test functions. Does that help?

@cmeesters
Copy link
Member Author

not sure - will try as soon as I find time (read: hopefully tomorrow)

@johanneskoester
Copy link
Contributor

For your current failures with this, could you post the slurm job log here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants