-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scatter using Singularity fails creating SIF #5
Comments
Sometimes this is still happening. The file lock on the .sif that is created when the first job starts to convert the docker container doesn't seem to work right. I'm unsure still how to fix that. |
I'm still not sure how to fix this. I have begun just sending a workflow with a scatter in it with a scatter of 1 first, so that that one job will go pull the docker container and convert it, and then when the larger scatter happens, it's already cached in Scratch. |
Just to clarify- it's been a while since I dug into this- we're still doing the image build as part of the I've got the diy server up and running and have a minimal case for reproducing this. I'm not sure if the switch to the new home file server will change the behavior at all (i.e. will it still throw a "stale file handle"), but all of the shards are still trying to build the image concurrently which will produce some... interesting... results.
Where is that lock file set? Is it set via |
By way of reference, it looks like there are others working on the issue: broadinstitute/cromwell#5063 and a possible helper in prepull |
I've put in a PR ... does this look like it might work for us here? |
I'm not sure how to incorporate the biowdl |
Part of the problem may be that gizmok170[~]: echo $SINGULARITY_CACHEDIR/
/loc/scratch/49107063/
gizmok170[~]: singularity cache list -v
NAME DATE CREATED SIZE TYPE
There are 0 container file(s) using 0.00 kB and 0 oci blob file(s) using 0.00 kB of space
Total space used: 0.00 kB
gizmok170[~]: singularity pull docker://godlovedc/lolcow
FATAL: Image file already exists: "lolcow_latest.sif" - will not overwrite
gizmok170[~]: unset SINGULARITY_CACHEDIR
gizmok170[~]: singularity cache list -v
NAME DATE CREATED SIZE TYPE
lolcow_latest.sif 2022-02-14 16:09:49 92.23 MB oci
3b61febd4aefe982e0cb9c 2022-02-14 16:09:17 0.85 kB blob
73d5b1025fbfa138f2cacf 2022-02-14 16:09:21 3.41 kB blob
7fac07fb303e0589b9c23e 2022-02-14 16:09:18 0.17 kB blob
8e860504ff1ee5dc795367 2022-02-14 16:09:21 56.36 MB blob
9d99b9777eb02b8943c0e7 2022-02-14 16:09:17 0.62 kB blob
9fb6c798fa41e509b58bcc 2022-02-14 16:09:17 47.54 MB blob
d010c8cf75d7eb5d2504d5 2022-02-14 16:09:18 0.85 kB blob
f2a852991b0a36a9f3d6b2 2022-02-14 16:09:22 1.12 kB blob
There are 1 container file(s) using 92.23 MB and 8 oci blob file(s) using 103.90 MB of space
Total space used: 196.12 MB
gizmok170[~]: singularity pull docker://godlovedc/lolcow
FATAL: Image file already exists: "lolcow_latest.sif" - will not overwrite
gizmok170[~]: singularity --version
singularity version 3.5.3 There is an issue from deep-dark history that suggests this should have been fixed by the time this came out, but it's apparent that |
oh. well. So that's nice. Wow. So relying on some fancy smartness on the part of |
I'm going to request an updated singularity, see if that's fixed in later versions |
When using
singularity exec
within tasks executed within scatter there is a race condition when the Docker/Singularity image isn't in the cache. On NFS mounted home directories this apparently results in an error "stale NFS file handle"To replicate it's necessary to remove the images from the Singularity cache (
~/.singularity
). It also is difficult to replicate with simple images (e.g.ubuntu
). The Broad's GATK image seems to reproduce this error fairly reliably.The text was updated successfully, but these errors were encountered: